-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues and Questions about Execution of LLaMA using NNTrainer #2561
Comments
cibot: Thank you for posting issue #2561. The person in charge will reply soon. |
GPU support of NNTrainer is WIP. I expect to see running LLMs on GPU around May~June. (e.g., #2535 / https://github.com/nnstreamer/nntrainer/pulls?q=is%3Apr+author%3As-debadri ) @s-debadri has been actively contributing GPU-related codes. This is based on OpenCL because we target GPUs of embedded devices (mobile, TV, home appliances, ...), not servers with powerful A100/H100/B100. As long as they support OpenCL, they would work; however, not as efficient as CUDA on NVidia GPUs. |
Must-have metric: peak memory concumption, first-token latency, per-token latency after the first token output (or "throughput") |
I wonder whether you changed the configuration for the 7b in HuggingFace. The current implementation is for the 1B.
We will check and let you know. |
Thank you for the clarification. We have been using the "meta-llama/Llama-2-7b-chat-hf", which is 7B. We planned to change the model to "TinyLlama/TinyLlama-1.1B-Chat-v1.0", is this the recommended one? If not is there any recommended model to be used for the LLaMA application? |
We will check the model including TinyLlam. The current implementation is for the kind of tasks like summarization, tone conversion, etc. But TinyLlama seems like it does not have tokenizer compatibility with our implementation. Let us check and we will let you know. |
Currently, the issue of LLaMA application has been modified, and please pull up the latest version and use it. |
Thank you for the update. We will look into the modified version. |
We were able to follow the instructions as mentioned and able to build. We wanted to get the llama model weights to check the model weight file "./llama_fp16.bin", where could we correctly find this file? or which model could we correctly use to get the weights? |
Hi @myungjoo @jijoongmoon, I'm completely new to this framework, and I want to thank you for your contributions to open source.
|
@martinkorelic We have custom layers and nntrainer apps supporting LoRA and fine-tuning. We will be able to upstream the code soon when things are cleared (not sure when, but I'm pretty sure it may be within a few months). The model structures will be completely up to nntrainer users, and it will become easier especially after ONNX importer implementation. |
file path: nntrainer/Applications/LLaMA/jni/main.cpp add #define ENABLE_ENCODER2 in the beginning
file path:nntrainer/meson.build add "message ('platform: @0@'.format(get_option('platform')))" in the 28th line of the code. add "message ('enable-fp16: @0@'.format(get_option('enable-fp16')))" in the 68th line of the code
file path:nntrainer/meson_options.txt -enable the fp16 option as true in the 39th line "option('enable-fp16', type: 'boolean', value: true)"
Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.
Do you have any recommendation for benchmarks to run to test results from LLaMA execution using NNTrainer?
We also wanted to ask if we could run NNTrainer on a commercial off-the-shelf GPU. We currently have the NVIDIA A 6000.
Progress update by - Professor Hokeun Kim (https://github.com/hokeun) and his student Deeksha Prahlad (https://github.com/Deeksha-20-99)
The text was updated successfully, but these errors were encountered: