Tentative clarification around supported LLM models #147

francoishernandez · 2024-11-20T10:58:48Z

francoishernandez
Nov 20, 2024
Maintainer

Context

LLMs have been moving fast. While we’ve been striving to stay ahead with various implementations along the way, it’s not always clear what models are fully supported.

To address this, I’m kickstarting a small framework to clarify and track model support. This initiative is a starting point for better transparency and collaboration within the eole community.

This is also a good place to start for potential contributors, outlining what kind of adaptations would be nice additions to support new models.

Also, keeping track of some relatively simple benchmark(s) allows for a replicable setup, facilitating the prevention of breaking changes. This is relatively critical in the context of eole, because of its unified architecture, sharing main components across various models.

So, what’s the plan?

List models that “should” be supported (either previously validated, or similar to supported models);
Run a standardized benchmark and compare to reference values;
Release a small script as a recipe to facilitate reproducing and running on any (HF) model -> Model Validator Recipe #146;
Keep track of everything in a specific Github Projects (might not be the best solution, but nice to keep everything in Github for now).

Disclaimer

MMLU is clearly not the most relevant LLM benchmark out there, and [its implementation impact results](https://huggingface.co/blog/open-llm-leaderboard-mmlu), but it’s already implemented, and reference values exist for most models. The point is to have a first point of comparison to validate model support. Additional benchmarks can be
I have not infinite compute at my disposal, so the scope of the experiments is limited, especially for larger models. I have access to some AWS P3 instances, but it’s quite slow (V100).
The first batch ran with the more_tokenizers branch (Supporting HF tokenizers #122) with a few tweaks. I’m currently running on the main branch to identify potential discrepancies.

Observations

Most results are within a reasonable margin of the reference results. Slight differences are probably not alarming, considering the low absolute reliability of the MMLU benchmark.

That being said, there are a few very noticeable gaps between reference MMLU (grabbed from HF model cards mostly), and our numbers.

Eole < Reference:
- meta-llama/Llama-3.2-1B-Instruct”: the smaller the model, the more sensitive it can be to slight implementation differences;
- “microsoft/Phi-3.5-mini-instruct”: not really sure about this one, maybe a lack of robustness as well;
- “microsoft/Phi-3-mini-128k-instruct”: same, but that is even weirder that the similar “microsoft/Phi-3-mini-4k-instruct” does not show such a gap.
Eole > Reference:
- “meta-llama/CodeLlama-7b-hf”: I don’t really have an explanation for this one.

Tracking

Experiments will be tracked here, as well as in the specific Github Project.

This project is a draft, and might evolve. For instance, we could create some model-support repo to create clean issues and keep track of various model related topics. Also, it might move out of Github Projects for better extendability. We’ll see how this evolves.

There are a few structural fields, notably:

“Family” : to differentiate various model families (llama, mistral, phi, etc.)
“Status”:
- “To test”: This model should be tested.
- “To test again”: This model was working once, but should be tested again.
- “Testing”: This model is currently being tested.
- “Adaptations needed”: This model requires further adaptation.
- “Validated”: This model has been tested and should be 100% compatible with eole.

Next steps

Test bigger models on AWS;
Test batch inference;
Test quantization (both bitsandbytes and AWQ);
Try and support more models;
Add new benchmarks (including speed and efficiency);
…

HURIMOZ · 2024-11-29T04:37:52Z

HURIMOZ
Nov 29, 2024

Hi François, Iʻm keen to test fine-tuning Llama 2 or 3.1 with my bilingual datasets. In the recipe provided for wmt22_with_TowerInstruct-llama2 I donʻt see a yaml config file to train from the Llama model. Can you explain briefly how I can go about fine-tuning an LLM with my bilingual datasets?

1 reply

francoishernandez Nov 29, 2024
Maintainer Author

The key here is the prompt format. Basically you need to "convert" your bilingual data in a single prompt usable by the decoder-only structure of such LLMS.
You can see how the prompt is structured here: https://github.com/eole-nlp/eole/blob/main/recipes/wmt22_with_TowerInstruct-llama2/promptize_llama2.py

@vince62s might elaborate a bit more on potential finetuning tricks.

Note: if this discussion is going to expand, you might want to open a new one, as it's not strictly related to the main topic here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tentative clarification around supported LLM models #147

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Tentative clarification around supported LLM models #147

francoishernandez Nov 20, 2024 Maintainer

Context

So, what’s the plan?

Disclaimer

Observations

Tracking

Next steps

Replies: 1 comment · 1 reply

HURIMOZ Nov 29, 2024

francoishernandez Nov 29, 2024 Maintainer Author

francoishernandez
Nov 20, 2024
Maintainer

Replies: 1 comment 1 reply

HURIMOZ
Nov 29, 2024

francoishernandez Nov 29, 2024
Maintainer Author