Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues and Questions about Execution of LLaMA using NNTrainer #2561

Open
Deeksha-20-99 opened this issue Apr 30, 2024 · 13 comments
Open

Issues and Questions about Execution of LLaMA using NNTrainer #2561

Deeksha-20-99 opened this issue Apr 30, 2024 · 13 comments
Assignees

Comments

@Deeksha-20-99
Copy link

  • We have executed the LLaMA model (downloaded from HuggingFace[https://huggingface.co/meta-llama/Llama-2-7b-chat-hf]) using the NNTrainer and obtained the following output by following these steps:
  1. File changes made before running the LLaMA model
-file path: nntrainer/Applications/LLaMA/PyTorch
run the llama_weights_converter.py file to generate the "./llama_fp16.bin" files. (Hugging face LLaMA model) Save the file in the nntrainer/jni directory.
    file path: nntrainer/Applications/LLaMA/jni/main.cpp
add #define ENABLE_ENCODER2 in the beginning
    file path:nntrainer/meson.build
add "message ('platform: @0@'.format(get_option('platform')))" in the 28th line of the code.
add "message ('enable-fp16: @0@'.format(get_option('enable-fp16')))" in the 68th line of the code
    file path:nntrainer/meson_options.txt
-enable the fp16 option as true in the 39th line "option('enable-fp16', type: 'boolean', value: true)"
  2. Run the "meson build" and "ninja -C build" command in the NNTrainer directory
  3. enter the jni directory inside NNTrainer and run "../build/Applications/LLaMA/jni/nntrainer_llama"
  • We executed with setting the locale to : “std::locale::global(std::locale("ko_KR.UTF-8”));” got the following output
Korean locale
  • We executed with setting the locale to : “std::locale::global(std::locale("en_US.UTF-8”));” got the following output
Screenshot 2024-04-29 at 4 26 29 PM
  • We executed with commenting the locale statement and got this output
Screenshot 2024-04-29 at 4 30 13 PM
  • Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.

  • Do you have any recommendation for benchmarks to run to test results from LLaMA execution using NNTrainer?

  • We also wanted to ask if we could run NNTrainer on a commercial off-the-shelf GPU. We currently have the NVIDIA A 6000.

Progress update by - Professor Hokeun Kim (https://github.com/hokeun) and his student Deeksha Prahlad (https://github.com/Deeksha-20-99)

@taos-ci
Copy link

taos-ci commented Apr 30, 2024

:octocat: cibot: Thank you for posting issue #2561. The person in charge will reply soon.

@myungjoo
Copy link
Member

myungjoo commented Apr 30, 2024

We also wanted to ask if we could run NNTrainer on a commercial off-the-shelf GPU. We currently have the NVIDIA A 6000.

GPU support of NNTrainer is WIP. I expect to see running LLMs on GPU around May~June. (e.g., #2535 / https://github.com/nnstreamer/nntrainer/pulls?q=is%3Apr+author%3As-debadri ) @s-debadri has been actively contributing GPU-related codes.

This is based on OpenCL because we target GPUs of embedded devices (mobile, TV, home appliances, ...), not servers with powerful A100/H100/B100.

As long as they support OpenCL, they would work; however, not as efficient as CUDA on NVidia GPUs.

@myungjoo
Copy link
Member

myungjoo commented Apr 30, 2024

Do you have any recommendation for benchmarks to run to test results from LLaMA execution using NNTrainer?

Must-have metric: peak memory concumption, first-token latency, per-token latency after the first token output (or "throughput")
Good-to-have metric: energy consumption (J) per given number of input tokens, throughput with given power (W) and thermal budgets, computation resource (CPU, GPU) utilization statistics, average and peak memory (DRAM) traffic. These additional metrics provide idea on how it would behave in actual user devices; battery consumption, throttled performance due to temperature, performance when there are other apps running, and so on.

@myungjoo
Copy link
Member

Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.

@lhs8928 @baek2sm ?

@Deeksha-20-99
Copy link
Author

We would like to thank the team for fixing the issue through the commit. We were able to overcome the segmentation fault and run the LLaMA model. We got the output as seen in the images but we are still not able to understand the output that is printed.
Screenshot 2024-04-30 at 5 36 46 PM
Screenshot 2024-04-30 at 5 39 11 PM

@jijoongmoon
Copy link
Collaborator

jijoongmoon commented May 1, 2024

I wonder whether you changed the configuration for the 7b in HuggingFace. The current implementation is for the 1B.
Do you want to use the Application/LLaMA as a kind of chatbot? then I think it needs some fixes as well. As you can see in the code, it just takes the prefill context and generates the output one time. For chatbot kind of task, we need a kind of iteration ( it is not difficult though) to keep the KV cache alive.

Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.

We will check and let you know.

@Deeksha-20-99
Copy link
Author

Thank you for the clarification. We have been using the "meta-llama/Llama-2-7b-chat-hf", which is 7B. We planned to change the model to "TinyLlama/TinyLlama-1.1B-Chat-v1.0", is this the recommended one? If not is there any recommended model to be used for the LLaMA application?

@jijoongmoon
Copy link
Collaborator

jijoongmoon commented May 2, 2024

We will check the model including TinyLlam. The current implementation is for the kind of tasks like summarization, tone conversion, etc. But TinyLlama seems like it does not have tokenizer compatibility with our implementation. Let us check and we will let you know.

@baek2sm
Copy link
Contributor

baek2sm commented Jul 17, 2024

Currently, the issue of LLaMA application has been modified, and please pull up the latest version and use it.
When you build, you need to set "enable_encoder" option to true in meson_option.txt for using tokenizer.
Then, run with vocab.json and merges.txt files in the same path as the executable file.

@Deeksha-20-99
Copy link
Author

Thank you for the update. We will look into the modified version.

@Deeksha-20-99
Copy link
Author

We were able to follow the instructions as mentioned and able to build. We wanted to get the llama model weights to check the model weight file "./llama_fp16.bin", where could we correctly find this file? or which model could we correctly use to get the weights?

@martinkorelic
Copy link

Hi @myungjoo @jijoongmoon, I'm completely new to this framework, and I want to thank you for your contributions to open source.
My questions are regarding LLMs in this framework:

  • If it is supported to fine-tune such a pretrained model already with the given example?
  • Will there be more model structures added, or is it possible to start creating custom ones and loading weights from them?
  • Is it possible to implement parameter efficient finetuning like LoRA with them?

@myungjoo
Copy link
Member

@martinkorelic We have custom layers and nntrainer apps supporting LoRA and fine-tuning. We will be able to upstream the code soon when things are cleared (not sure when, but I'm pretty sure it may be within a few months). The model structures will be completely up to nntrainer users, and it will become easier especially after ONNX importer implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants