CUDA Debug/Performance TODOS #12

odarbelaeze · 2016-07-29T14:49:19Z

Verify that the materials with zero atoms are properly handled in CUDA stats.
Limit the cu::grid_size variable at initialisation (according to the plan)
Add restrict const to read only arrays in kernel calls for better caching
Add support for tensorial exchange
Move to cusp for vector-matrix multiplication with DIA format
Check register usage in Heun scheme - may not be too bad
CuSPARSE calls are asynchronous - overlap with spin fields at the cost of extra storage? Also need to check for synchronisation before Heun integration
Remove a bunch of intermediate variables from external field calculation - might help performance
The discrimination towards energy as a stat in GPU needs to stop #stopTheDiscrimination

That's one thing, let's see what else comes up and register it here.

The text was updated successfully, but these errors were encountered:

richard-evans · 2016-07-30T00:43:31Z

Try symmetric matrix calculation to improve sparse matrix performance, but requires data preconditioning
Fix a few type casts to speed up SP performance, eg sqrtf, 1.0f etc

odarbelaeze · 2016-08-05T13:58:45Z

Jxx_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[0]);
Jyy_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[1]);
Jzz_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[2]);

I can guess the xx, yy, and zz values of the exchange for the vector case, but, what would be the indices for xy, xz, ...etc for the tensor case?

mattoaellis · 2016-08-05T14:01:36Z

IIRC there is separate exchange lists for isotropic, exchange and tensor. So to do the case for xy,xz,.. you probably want something like:
Jxy_vals_h.push_back( -::atoms::t_exchange_list[iid].Jij[0][1)]

I would need to check the src files though.

odarbelaeze · 2016-08-05T14:09:51Z

@mattoaellis since you're here, did you try CUSP to do the matrix multiplication I'm looking at this at the moment http://cusplibrary.github.io/classcusp_1_1csr__matrix__view.html#a2a5a8f5d37b23c657002ad4c31509688

mattoaellis · 2016-08-05T14:13:23Z

Yeah, I've used CUSP in the past for matrix multiplication. It's pretty neat for setting up and easy calculations. It has both the diagonal format and Ellpack which are both useful for the exchange calculation.
I thought we decided against it as it wasn't part of the main CUDA toolkit and would require people to download the library themselves?
Another downside is (unless they have updated it) they don't have any symmetric solvers since they claim the transpose would not be efficient enough.

odarbelaeze · 2016-08-05T14:22:33Z

Hum, it seems to be on the clusters, no extra downloads required, furthermore seems like DIA is the way to go and is very easy to do transformations with this.

mattoaellis · 2016-08-05T14:32:36Z

I think that was because we were using it before. Out of interest what version is it? I would certainly recommend it and downloading an extra library is not that hard but I don't think we can count on it being pre-installed on every cluster.
DIA is likely going to be the best but not always as it heavily depends on the numerical ordering of the spins. ELL is a bit more versatile in that respect as it only depends on having a constant number of values per row which unless it is some strange exchange we are liekly to have. I suggest setting it up as a typedef within a macro so that we can easily switch between them if we go for CUSP:

#ifdef __DIAEXCH
typedef cusp::dia_matrix<int, double, cusp::device_memory> exch_matrix_t
#else
typedef cusp::ell_matrix<int, double, cusp::device_memory> exch_matrix_t
#endif

In CUSP it is super easy to set up the matrix on the host in a simple format i.e csr which I think the serial version uses anyway then jsut convert to our chosen format.

 cusp::csr_matrix<int, double, cusp::host_memory> exchange_h(N,N,NNbrs);
\\ fill matrix
exch_matrix_t exchange_d(exchange_h); // Should do the conversion and host-to-device copy for us

odarbelaeze · 2016-08-05T14:48:17Z

Jeje, I'm on it, I'll just amalgamate the library now, as we did with CUB, we can clean up the mess before we merge to development 💯 , is it safe to do this:

cusp::array1d_view <...> spin_view (some_thrust_device_vector.begin(), some_thrust_device_vector.end());

mattoaellis · 2016-08-05T14:50:58Z

Yeah, the cusp arrays are just slightly expanded thrust vectors so they are all compatible. Annoyingly thrust is (mostly) compatible with cusp arrays but not vice-versa. Why the array view in this case? Since the array goes from begin() to end() why not array1d?

mattoaellis · 2016-08-05T14:52:32Z

Ah I think I get it. Is it because the view does not copy the data but jsut references the underlying data?

odarbelaeze · 2016-08-05T15:02:10Z

Yeah, just to keep the spin arrays intact so that the changes do not conflict with other pieces of the code 😸

mattoaellis · 2016-08-05T15:21:49Z

I've just tested that and it seems to work fine creating a array view from a thrust device vector and passing that to cusp::multiply.

odarbelaeze · 2016-08-05T15:42:25Z

I'm, integrating it, but seems like including the cusp headers has conflicts with currand, did you run into those?

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

Those are not there if I omit the cusp headers 👎

odarbelaeze · 2016-08-05T16:00:25Z

Seems like this could be the issue, however, we are working with cuda 6.5, and reversing orders here and there does not help it 😢 http://stackoverflow.com/questions/23352122/including-thrust-sort-h-before-curand-kernel-h-gives-compilation-error

https://github.com/cusplibrary/cusplibrary/blob/a9a642d0a29537124b1d05b3378fa97ecf7fb09f/cusp/system/detail/generic/sort.h#L26

mattoaellis · 2016-08-05T16:29:49Z

I've never had any problem with curand, cusp and thrust all together. As the post says it is a version issue; I remember hearing some worrying reports about bugs like that. Can you drop back to CUDA 5 to test out? Otherwise it may require moving the order the headers are included in.

mattoaellis · 2016-08-05T16:31:52Z

Ah I do remember having some issues with CUSP at one point (not the same as yours) when I moved to a newer version but they were just warnings rather than errors and it ran jsut fine. Check whether the version of CUSP you are including is the latest one. If not you could downlaod the library locally.

odarbelaeze · 2016-08-05T16:41:16Z

I downloaded the 0.4.0 version, I think is the latest, I'll have to keep changing the order of the include statements.

mattoaellis · 2016-08-06T15:22:41Z

I've tested out cuda 5.5 and 6.5 using the latest version of CUSP v0.5.1 and I don't seem to have any problems. Which CUSP and Curand headers are you using exactly?

odarbelaeze · 2016-08-08T09:35:11Z

@mattoaellis is there a nice way to do the C = a*A*B + b*B thing we do with CuSPARSE in CUSP? also see odarbelaeze@98cafe2#diff-ab8be042c5e3fb066cb0ea4e24a10990R15, that compiles in Wohlfarth.

mattoaellis · 2016-08-08T09:58:19Z

@odarbelaeze CuRAND or CuSparse?

There is a more generalised multiply routine which allows you to specify details about the operation
http://cusplibrary.github.io/group__matrix__algorithms.html#gafc7e7d80ae5f5f3fcd696f89074aa0b2

That is in CUSP v0.5.1 but I don't know if it is v0.4. Mind you for the operation you specify then you can just fold b*B into the diagonals of A. I have done that in the past for computing the anisotropy in addition to the exchange interaction.

odarbelaeze · 2016-08-08T13:15:06Z

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

Step one is to create array views for the fields and spins.
Step two is to perform the multiplication.

generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.

richard-evans · 2016-08-08T13:21:39Z

Yuck - thats not what we wanted. So is the solution to do DIA by hand and then pass this to cusparse? Or can cusp do the conversion once and store that, and then just pass this as a DIA matrix?

On 8 Aug 2016, at 14:15, Oscar David Arbeláez E. [email protected] wrote:

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

Step one is to create array views for the fields and spins.
Step two is to perform the multiplication.
generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vMFHgsc1MRdUr66M_Grw32cp_zoZks5qdyvagaJpZM4JYQ4n.

Dr. Richard F L Evans
Lecturer
The Department of Physics
University of York
York, UK
YO10 5DD
Tel: +44 (0)1904 322822
Room: Quantum Hub 212b
Email: [email protected]

www-users.york.ac.uk/~rfle500/
vampire.york.ac.uk

mattoaellis · 2016-08-08T13:28:13Z

That is rather strange.
Indeed CUSP can do the conversion and then you can access the data how you want to perform the spmv elsewhere.
I've never had this issue but I used the cusp arrays rather than the thrust ones so never needed the array views. Would it be worth swapping to the cusp arrays since they can be used the same as thrust ones elsewhere?

odarbelaeze · 2016-08-08T13:28:55Z

That could be possible, since cusp handles the transformations easily, however, seems like the DIA format is not available in CuSPARSE http://docs.nvidia.com/cuda/cusparse/#matrix-formats, we could experiment with the ELL format, or try to set up the arrays as cusp::array1d instead of thrust::device_vector, but that may require major changes in the whole vcuda package.

odarbelaeze · 2016-08-08T13:30:14Z

@mattoaellis the thrust::raw_pointer_cast signature is the same with cusp::array1d types?

mattoaellis · 2016-08-08T13:32:38Z

Yep, that has worked in the past for me.

odarbelaeze · 2016-08-08T14:07:49Z

@mattoaellis @richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍

richard-evans · 2016-08-08T14:09:26Z

Good news! Lets see how things turn out...

On 8 Aug 2016, at 15:07, Oscar David Arbeláez E. [email protected] wrote:

@mattoaellis https://github.com/mattoaellis @richard-evans https://github.com/richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vLEfPvUAovE056Iy0owPr9nR2ZPIks5qdzg1gaJpZM4JYQ4n.

odarbelaeze · 2016-08-08T14:24:16Z

I did a real quick curie temperature benchmark,

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

real    19m29.640s
user    19m25.569s
sys     0m0.468s

I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅

richard-evans · 2016-08-08T14:47:09Z

Ok nice! so ~ 5x faster for exchange calculation. Yes I deleted the NVCC variable in my PR. So maybe next thing is to reprofile and see where the new hotspots are.

R

On 8 Aug 2016, at 15:24, Oscar David Arbeláez E. [email protected] wrote:

I did a real quick curie temperature benchmark,

With CuSPARSE CSR

real 3m21.703s
user 2m16.917s
sys 1m3.840s

With CUSP DIA

real 2m5.137s
user 1m30.466s
sys 0m34.178s
I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vA8rIuQ3CTlTebwBU0KHPoRCkLOtks5qdzwRgaJpZM4JYQ4n.

odarbelaeze · 2016-08-08T15:43:26Z

I need better cpus in Colombia 😞

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

# With 1 serial core in Wohlfarth
real    19m29.640s
user    19m25.569s
sys     0m0.468s

odarbelaeze changed the title ~~CUDA Debug TODOS~~ CUDA Debug/Performance TODOS Aug 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Debug/Performance TODOS #12

CUDA Debug/Performance TODOS #12

odarbelaeze commented Jul 29, 2016 •

edited by richard-evans

Loading

richard-evans commented Jul 30, 2016 •

edited

Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016 •

edited

Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016 •

edited

Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016 •

edited

Loading

mattoaellis commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 6, 2016

odarbelaeze commented Aug 8, 2016 •

edited

Loading

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

richard-evans commented Aug 8, 2016

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

richard-evans commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016 •

edited

Loading

richard-evans commented Aug 8, 2016

With CuSPARSE CSR

With CUSP DIA

odarbelaeze commented Aug 8, 2016 •

edited

Loading

CUDA Debug/Performance TODOS #12

CUDA Debug/Performance TODOS #12

Comments

odarbelaeze commented Jul 29, 2016 • edited by richard-evans Loading

richard-evans commented Jul 30, 2016 • edited Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016 • edited Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016 • edited Loading

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016 • edited Loading

mattoaellis commented Aug 5, 2016

mattoaellis commented Aug 5, 2016

odarbelaeze commented Aug 5, 2016

mattoaellis commented Aug 6, 2016

odarbelaeze commented Aug 8, 2016 • edited Loading

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

richard-evans commented Aug 8, 2016

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

mattoaellis commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016

richard-evans commented Aug 8, 2016

odarbelaeze commented Aug 8, 2016 • edited Loading

richard-evans commented Aug 8, 2016

With CuSPARSE CSR

With CUSP DIA

odarbelaeze commented Aug 8, 2016 • edited Loading

odarbelaeze commented Jul 29, 2016 •

edited by richard-evans

Loading

richard-evans commented Jul 30, 2016 •

edited

Loading

mattoaellis commented Aug 5, 2016 •

edited

Loading

mattoaellis commented Aug 5, 2016 •

edited

Loading

odarbelaeze commented Aug 5, 2016 •

edited

Loading

odarbelaeze commented Aug 8, 2016 •

edited

Loading

odarbelaeze commented Aug 8, 2016 •

edited

Loading

odarbelaeze commented Aug 8, 2016 •

edited

Loading