-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] I can write code with large tensors, and the derivative runs as fast as PyTorch on GPU #43
Comments
Related transcript from datahaskell/lobby (https://matrix.to/#/!JSHTXUPHaIJhjEINXX:matrix.org?via=m.topoi.dev&via=gitter.im&via=matrix.org) follows the line: anton_5 Staging would be really nice to have but my understanding is that Haskell doesn’t really have good support for staged metaprogramming right now. Anyways, I think that using array fire would be a much quicker way to have your lib working with gpus, so it could be a good starting point. Also, what is nice about arrayfire is that at some point I managed to run it even on intel’s built-in gpu (in my laptop) so it’s not limited only to Nvidia cards. https://iagoleal.com/posts/calculus-symbolic-ode/ if going via the staging route, I think it’s worth exploring the possibility of emitting MLIR https://mlir.llvm.org/. I have just found MLIR Haskell bindings! :) https://github.com/google/mlir-hs. |
https://github.com/arrayfire/arrayfire-haskell looks great, in particular, by implementing the same vast set of operations on CPU (using various extensions) and GPU, transparently. The convolution operations are described in the language of signal processing and it's not obvious if they are performant enough for ML (e.g., a dimension is said to be flipped, but for ML it's unneeded, so correlation is used, in fact, but here correlation is not available) Functions are not shape-typed (and indeed they dynamically produce arrays of different sizes depending on complex conditions on arguments, even if theoretically these could be statically derived). Arrays have only up to four dimensions. Unlike hmatrix, it doesn't just use the storable vectors as the underlying representation, but requires a costly conversion (allocating a copy of the whole array) even for read-only use. I don't know if the memory representation is different (I'd guess not) or if that's just a design decision for the particular FFI bindings, cheaply ensuring that arrays are always copied, even if they are used read-only. Or perhaps the internal representation differs for each backend? I think we could use this library instead of hmatrix, but we'd need to benchmark the two to make sure CPU performance is not much worse. Ideally, performance would remain acceptable even if we insist on converting to storable vectors on each gradient descent iteration. An option is to add more ranks (arrayfire untyped and arrayfire typed) and extend the optimizers to cope with the new representations. I wonder, are 4 dimensions enough? |
Actually the problem with costly conversion to storable vectors is bigger than once per gradient descent slowdown. Every numerical operation on orthotope tensors is implemented by converting to hmatrix and back, which is cheap (cheap only for vectors and matrices, because hmatrix doesn't support 3 and 4 dimensions, so that is implemented by iterating over a vector of matrices, so there ArrayFire would have an advantage). Consequently, each numerical operation would bear the extra cost. Which means we may need to implement a version of orthotope backed by ArrayFire arrays instead of storable vectors, in addition to extending/changing our set of implemented ranks. I'm not so keen on this, especially that the Haskell bindings to ArrayFire had one or two commits since 2019. Is the C API so stable? Is it, because the C API is so comprehensive and well done? Or are the Haskell bindings behind the C API and that's why correlation is not available and only the signal processing-style convolution is? Checking with which GHC it works and how fast it can be ported would be a good indication of the Haskell bindings maintenance status. Or, a more direct route may be asking the kind maintainer @dmjio. :) Now that I think about it, different array representations for different backends are necessary if the data is to stay in GPU in-between individual arithmetic operations. Sending data back to CPU (so that the Storable Vector pointer can be created) after each matmul operation would be a performance disaster. To fake the Storable Vector representation at no cost, we'd need the sending to CPU to be lazy, with a rewrite rule that if the data is sent back unchanged to GPU, no sending is performed at all. But it's IMHO more natural to just reimplement with ArrayFire the few orthotope operations that touch the underlying vector (most orthotope operations only tweak the dimensions list, by design). |
Hello @Mikolaj 👋🏼 😄 Regarding the Haskell ArrayFire bindings, since reading your comment I've updated the bindings to support the latest In general the Regarding the convolve function (among others) I believe you're referring to, I will regenerate the low-level bindings to incorporate these new functions and make a The nix build isn't well supported any longer, but installing More than happy to continue to support |
@dmjio, this is amazing news. If we add Haskell ArrayFire to the pool of our backends, I will certainly badger you with requests and, let's hope, also patches. And I'm already grateful that you grant us the option. BTW, do you happen to know why the underlying data container in Haskell ArrayFire is an abstract pointer, instead of Storable Vector (with a primitive Haskell Storable array under the hood), as in hmatrix or orthotope and some others? Also, any long term plans for a dependently typed sized (shaped) very high level API, similar to orthotope? |
My understanding is that the operations exposed via the C API first construct an AST that gets JIT compiled into code that minimizes the number of GPU round trips and memory allocations. So one perspective is that the ArrayFire API allows you to construct thunks that are lazily evaluated on the target backend (GPU or CPU), and efficiently. @9prady9 and the team can go into more details I'm sure. These resources helped me understand more of the lazy evaluation infrastructure
But I'm pretty sure this is the reason for the abstract type.
Definitely open to it. I know |
I gathered some more feedback on ML for GPU. Staging (JIT) to MLIR repeats a lot, due to how costly GPU-CPU communication is and how cool MLIR is. However, I'd prefer, for a start, not to handle staging myself, but outsource it. It seems ArrayFire does just that, though probably not with MLIR (it predates MLIR?). So let's try out ArrayFire and benchmark it against hmatrix #71. @dmjio, thank you for the explanation and the links. So it seems the conversion of the abstract pointer to a concrete RAM array is precisely the cue for ArrayFire to JIT and compute. If so, it may be reasonable (and perhaps optimal) to JIT once per the total gradient calculation in the gradient descent loop. More often than that increases the communication overhead. Less often keeps too much intermediate data in memory. In other words, converting the ArrayFire pointer to a RAM array whenever the main gradient descent state is updated, about which I worried previously, may indeed incur an unnecessary transfer from GPU, but at least it prompts JIT at the right moment. So the only substantial problem that remains is porting orthotope to arbitrary underlying tensor representation #72 instead of assuming a RAM array (Storable Vector). That's because triggering JIT after each single operation on tensors would be a performance disaster. But the port is not needed for ArrayFire vs hmatrix benchmarks, which can be done without type-level shape checking.
I don't have experience with that one, but I don't see the use of https://hackage.haskell.org/package/ghc-typelits-natnormalise in the typed hmatrix code examples. From my experience with orthotope, without the plugin, type-checking any larger code is a nightmare and the version that finally type-checks is terribly bloated (basically carrying lots of repeated trivial arithmetic statements like BTW, is the restriction to 4 dimensions something deep-seated or is it easy to lift? I never used more dimensions, but that's a mismatch vs orthotope API and also one more thing to warn users about. |
I believe this to be true.
If GHCs type checker will be too slow this sounds ideal. Also, might be best to newtype Array with type level dimension information and re-export a typed variant module for all current modules. Just a thought.
It seems limited to 4 dimensions currently. Seeing benchmarks against hmatrix would be exciting 😎 |
Note to self regarding ArrayFire (and may be totally wrong): it's seems the API does not contain anything like While googling for that, I noticed that Haskell CUDA bindings (https://hackage.haskell.org/package/cuda) seem to be even more low level, so that's probably a common problem (and, frankly, what was I thinking, how would ArrayFire or CUDA understand Haskell closures?). OTOH, Julia has bindings both for CUDA and ArrayFire and I wonder if they contain |
Just a slight note on the array construction within ArrayFire. The Tangentially, there are ways to save an But yea, for custom Array construction (via |
It's worth investigating if https://github.com/twesterhout/halide-haskell permits generation of CUDA of OpenCL code (for the user to, e.g., compile fro some exotic embedded device) and if so, it may well be a more convenient alternative to MLIR. |
Desiderata:
Supposedly, this can be done rather cheaply, offloading most of the work to LLVM and Haskell packages working with it.
My loose notes from a chat with Alp Mestanogullari:
you can always write instances for ASTs and JIT-compile algorithms
built on top of your differentiation mechanics
https://github.com/llvm-hs/llvm-hs-examples/blob/master/arith/Arith.hs
it's essentially scalar functions of just one (scalar) variable
and you can keep in mind that this could just be taken down to the GPU
path instead with llvm-ptx
the above example should look fairly simple to you, despite your lack
of familiarity with llvm-hs
https://github.com/llvm-hs/llvm-hs-examples/blob/master/arith/Arith.hs#L266
<- this is me building the LLVM AST/module for my little expression
type.
the rest is just plumbing.
so yeah if you want to take performance up a notch, I'd explore going
down that route. you can still use/refer to external functions (BLAS
and all), as long as you link to them, but this gives you a chance to
optimize your expressions and emit optimal machine code for your
computations, like a bunch of modern stuffs do
and imagine staging this whole process (i.e doing some TH to run the
codegen and linking the machine code in with your haskell executable)
so that none of that compilation happens at runtime... pretty ideal
world here.
https://github.com/google-research/dex-lang#dex- <- the paper for this
was very interesting btw, if you like thinking about
ASTs/DSLs/compilation/numerics/AD/etc
The text was updated successfully, but these errors were encountered: