Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/jafioti/luminal
Browse files Browse the repository at this point in the history
  • Loading branch information
jafioti committed Jul 9, 2024
2 parents fb544b7 + 9d37f25 commit 9aba341
Showing 1 changed file with 5 additions and 8 deletions.
13 changes: 5 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ use luminal::prelude::*;

// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.tensor().set([[1.0], [2.0], [3.0]]);
let b = cx.tensor().set([[1.0, 2.0, 3.0, 4.0]]);
let a = cx.tensor((3, 1)).set([[1.0], [2.0], [3.0]]);
let b = cx.tensor((1, 4).set([[1.0, 2.0, 3.0, 4.0]]);

// Do math...
let mut c = a.matmul(b).retrieve();
Expand Down Expand Up @@ -80,25 +80,22 @@ Now we can do:
- Devices and Dtypes are handled through compilers (just run the CUDA compiler to convert the graph to use CUDA kernels, then the fp16 compiler to convert to half-precision kernels)
- Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)

### Compile-time Shape Checks
All operations are shape checked at compile time, so no more shape mismatches! Credit for this goes to [dfdx](https://github.com/coreylowman/dfdx).

### View the Graph
Once you've written all your computation code, run `cx.display()` to see the entire computation graph in all it's glory. Pretty messy looking! Now run `cx.compile(GenericCompiler::default())` and display the graph again. Much better.

## Where are we?
- Metal and Cuda are supported for running models on Macs and Nvidia GPUs respectively, in both full and half precision.
- Performance on M-series macs with LLMs is within 20% of llama.cpp (a *heavily* optimized library)
- Mistral 7B and Llama 8B are implemented in `examples/`. See instructions above for running.
- We have a small library of NN modules in `nn`, including transformers.
- Full training support with graph-based autograd.
- Llama 3, Phi 3, Whisper and Yolo v8 are implemented in `examples/`. See instructions above for running.
- We have a small library of NN modules in `luminal_nn`, including transformers.
- A significant amount of high-level ops are implemented in `hl_ops`. We are aiming to match the most used ~80% of the pytorch api.
- The aim for 0.3 is to achieve SOTA performance on an M1 pro (50 tok/s), and near SOTA on single nvidia gpus (>100 tok/s), as well as support many mainstream models (Whisper, Stable Diffusion, Yolo v9, etc.) See the tracking issue [here](https://github.com/jafioti/luminal/issues/29)

Some things on the roadmap:
- Optimize cuda and metal matmul kernels
- Fine-grained metal and cuda IR
- Build benchmarking suite to test against other libs
- Autograd engine
- Distributed data, pipeline and tensor parallel.
- Beat PT 2.0 perf on LLM training
- Write compiler for quantum photonic retro encabulator
Expand Down

0 comments on commit 9aba341

Please sign in to comment.