Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Debug/Performance TODOS #12

Open
4 of 9 tasks
odarbelaeze opened this issue Jul 29, 2016 · 31 comments
Open
4 of 9 tasks

CUDA Debug/Performance TODOS #12

odarbelaeze opened this issue Jul 29, 2016 · 31 comments

Comments

@odarbelaeze
Copy link
Contributor

odarbelaeze commented Jul 29, 2016

  • Verify that the materials with zero atoms are properly handled in CUDA stats.
  • Limit the cu::grid_size variable at initialisation (according to the plan)
  • Add restrict const to read only arrays in kernel calls for better caching
  • Add support for tensorial exchange
  • Move to cusp for vector-matrix multiplication with DIA format
  • Check register usage in Heun scheme - may not be too bad
  • CuSPARSE calls are asynchronous - overlap with spin fields at the cost of extra storage? Also need to check for synchronisation before Heun integration
  • Remove a bunch of intermediate variables from external field calculation - might help performance
  • The discrimination towards energy as a stat in GPU needs to stop #stopTheDiscrimination

That's one thing, let's see what else comes up and register it here.

@richard-evans
Copy link
Owner

richard-evans commented Jul 30, 2016

  • Try symmetric matrix calculation to improve sparse matrix performance, but requires data preconditioning
  • Fix a few type casts to speed up SP performance, eg sqrtf, 1.0f etc

@odarbelaeze odarbelaeze changed the title CUDA Debug TODOS CUDA Debug/Performance TODOS Aug 5, 2016
@odarbelaeze
Copy link
Contributor Author

@richard-evans according to this:

Jxx_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[0]);
Jyy_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[1]);
Jzz_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[2]);

I can guess the xx, yy, and zz values of the exchange for the vector case, but, what would be the indices for xy, xz, ...etc for the tensor case?

@mattoaellis
Copy link
Collaborator

mattoaellis commented Aug 5, 2016

IIRC there is separate exchange lists for isotropic, exchange and tensor. So to do the case for xy,xz,.. you probably want something like:
Jxy_vals_h.push_back( -::atoms::t_exchange_list[iid].Jij[0][1)]

I would need to check the src files though.

@odarbelaeze
Copy link
Contributor Author

@mattoaellis since you're here, did you try CUSP to do the matrix multiplication I'm looking at this at the moment http://cusplibrary.github.io/classcusp_1_1csr__matrix__view.html#a2a5a8f5d37b23c657002ad4c31509688

@mattoaellis
Copy link
Collaborator

Yeah, I've used CUSP in the past for matrix multiplication. It's pretty neat for setting up and easy calculations. It has both the diagonal format and Ellpack which are both useful for the exchange calculation.
I thought we decided against it as it wasn't part of the main CUDA toolkit and would require people to download the library themselves?
Another downside is (unless they have updated it) they don't have any symmetric solvers since they claim the transpose would not be efficient enough.

@odarbelaeze
Copy link
Contributor Author

Hum, it seems to be on the clusters, no extra downloads required, furthermore seems like DIA is the way to go and is very easy to do transformations with this.

@mattoaellis
Copy link
Collaborator

mattoaellis commented Aug 5, 2016

I think that was because we were using it before. Out of interest what version is it? I would certainly recommend it and downloading an extra library is not that hard but I don't think we can count on it being pre-installed on every cluster.
DIA is likely going to be the best but not always as it heavily depends on the numerical ordering of the spins. ELL is a bit more versatile in that respect as it only depends on having a constant number of values per row which unless it is some strange exchange we are liekly to have. I suggest setting it up as a typedef within a macro so that we can easily switch between them if we go for CUSP:

#ifdef __DIAEXCH
typedef cusp::dia_matrix<int, double, cusp::device_memory> exch_matrix_t
#else
typedef cusp::ell_matrix<int, double, cusp::device_memory> exch_matrix_t
#endif

In CUSP it is super easy to set up the matrix on the host in a simple format i.e csr which I think the serial version uses anyway then jsut convert to our chosen format.

 cusp::csr_matrix<int, double, cusp::host_memory> exchange_h(N,N,NNbrs);
\\ fill matrix
exch_matrix_t exchange_d(exchange_h); // Should do the conversion and host-to-device copy for us

@odarbelaeze
Copy link
Contributor Author

Jeje, I'm on it, I'll just amalgamate the library now, as we did with CUB, we can clean up the mess before we merge to development 💯 , is it safe to do this:

cusp::array1d_view <...> spin_view (some_thrust_device_vector.begin(), some_thrust_device_vector.end());

@mattoaellis
Copy link
Collaborator

Yeah, the cusp arrays are just slightly expanded thrust vectors so they are all compatible. Annoyingly thrust is (mostly) compatible with cusp arrays but not vice-versa. Why the array view in this case? Since the array goes from begin() to end() why not array1d?

@mattoaellis
Copy link
Collaborator

Ah I think I get it. Is it because the view does not copy the data but jsut references the underlying data?

@odarbelaeze
Copy link
Contributor Author

Yeah, just to keep the spin arrays intact so that the changes do not conflict with other pieces of the code 😸

@mattoaellis
Copy link
Collaborator

I've just tested that and it seems to work fine creating a array view from a thrust device vector and passing that to cusp::multiply.

@odarbelaeze
Copy link
Contributor Author

I'm, integrating it, but seems like including the cusp headers has conflicts with currand, did you run into those?

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

Those are not there if I omit the cusp headers 👎

@odarbelaeze
Copy link
Contributor Author

odarbelaeze commented Aug 5, 2016

@mattoaellis
Copy link
Collaborator

I've never had any problem with curand, cusp and thrust all together. As the post says it is a version issue; I remember hearing some worrying reports about bugs like that. Can you drop back to CUDA 5 to test out? Otherwise it may require moving the order the headers are included in.

@mattoaellis
Copy link
Collaborator

Ah I do remember having some issues with CUSP at one point (not the same as yours) when I moved to a newer version but they were just warnings rather than errors and it ran jsut fine. Check whether the version of CUSP you are including is the latest one. If not you could downlaod the library locally.

@odarbelaeze
Copy link
Contributor Author

I downloaded the 0.4.0 version, I think is the latest, I'll have to keep changing the order of the include statements.

@mattoaellis
Copy link
Collaborator

I've tested out cuda 5.5 and 6.5 using the latest version of CUSP v0.5.1 and I don't seem to have any problems. Which CUSP and Curand headers are you using exactly?

@odarbelaeze
Copy link
Contributor Author

odarbelaeze commented Aug 8, 2016

@mattoaellis is there a nice way to do the C = a*A*B + b*B thing we do with CuSPARSE in CUSP? also see odarbelaeze@98cafe2#diff-ab8be042c5e3fb066cb0ea4e24a10990R15, that compiles in Wohlfarth.

@mattoaellis
Copy link
Collaborator

@odarbelaeze CuRAND or CuSparse?

There is a more generalised multiply routine which allows you to specify details about the operation
http://cusplibrary.github.io/group__matrix__algorithms.html#gafc7e7d80ae5f5f3fcd696f89074aa0b2

That is in CUSP v0.5.1 but I don't know if it is v0.4. Mind you for the operation you specify then you can just fold b*B into the diagonals of A. I have done that in the past for computing the anisotropy in addition to the exchange interaction.

@odarbelaeze
Copy link
Contributor Author

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

  • Step one is to create array views for the fields and spins.
  • Step two is to perform the multiplication.

generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.

@richard-evans
Copy link
Owner

Yuck - thats not what we wanted. So is the solution to do DIA by hand and then pass this to cusparse? Or can cusp do the conversion once and store that, and then just pass this as a DIA matrix?

On 8 Aug 2016, at 14:15, Oscar David Arbeláez E. [email protected] wrote:

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

Step one is to create array views for the fields and spins.
Step two is to perform the multiplication.
generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vMFHgsc1MRdUr66M_Grw32cp_zoZks5qdyvagaJpZM4JYQ4n.


Dr. Richard F L Evans
Lecturer
The Department of Physics
University of York
York, UK
YO10 5DD
Tel: +44 (0)1904 322822
Room: Quantum Hub 212b
Email: [email protected]

www-users.york.ac.uk/~rfle500/
vampire.york.ac.uk

@mattoaellis
Copy link
Collaborator

That is rather strange.
Indeed CUSP can do the conversion and then you can access the data how you want to perform the spmv elsewhere.
I've never had this issue but I used the cusp arrays rather than the thrust ones so never needed the array views. Would it be worth swapping to the cusp arrays since they can be used the same as thrust ones elsewhere?

@odarbelaeze
Copy link
Contributor Author

That could be possible, since cusp handles the transformations easily, however, seems like the DIA format is not available in CuSPARSE http://docs.nvidia.com/cuda/cusparse/#matrix-formats, we could experiment with the ELL format, or try to set up the arrays as cusp::array1d instead of thrust::device_vector, but that may require major changes in the whole vcuda package.

@odarbelaeze
Copy link
Contributor Author

@mattoaellis the thrust::raw_pointer_cast signature is the same with cusp::array1d types?

@mattoaellis
Copy link
Collaborator

Yep, that has worked in the past for me.

@odarbelaeze
Copy link
Contributor Author

@mattoaellis @richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍

@richard-evans
Copy link
Owner

Good news! Lets see how things turn out...

On 8 Aug 2016, at 15:07, Oscar David Arbeláez E. [email protected] wrote:

@mattoaellis https://github.com/mattoaellis @richard-evans https://github.com/richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vLEfPvUAovE056Iy0owPr9nR2ZPIks5qdzg1gaJpZM4JYQ4n.

@odarbelaeze
Copy link
Contributor Author

odarbelaeze commented Aug 8, 2016

I did a real quick curie temperature benchmark,

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

real    19m29.640s
user    19m25.569s
sys     0m0.468s

I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅

@richard-evans
Copy link
Owner

Ok nice! so ~ 5x faster for exchange calculation. Yes I deleted the NVCC variable in my PR. So maybe next thing is to reprofile and see where the new hotspots are.

R

On 8 Aug 2016, at 15:24, Oscar David Arbeláez E. [email protected] wrote:

I did a real quick curie temperature benchmark,

With CuSPARSE CSR

real 3m21.703s
user 2m16.917s
sys 1m3.840s

With CUSP DIA

real 2m5.137s
user 1m30.466s
sys 0m34.178s
I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vA8rIuQ3CTlTebwBU0KHPoRCkLOtks5qdzwRgaJpZM4JYQ4n.

@odarbelaeze
Copy link
Contributor Author

odarbelaeze commented Aug 8, 2016

I need better cpus in Colombia 😞

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

# With 1 serial core in Wohlfarth
real    19m29.640s
user    19m25.569s
sys     0m0.468s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants