score_br_model

Goal

This project is a quick and dirty tool to evaluate some large language models (LLMs) in their ability to carry out tasks via interaction in Breton language.
So far, only 2 tasks are implemented:
- br2fr (Breton to French translation)
- fr2br (French to Breton translation)
The evaluation produces a proximity score comparing the semantic distance of a text produced by an LLM with an expected text pre-written by a human evaluator.
The semantic distance is based on the proximity of OpenAI embeddings.
The LLMs made available by OpenRouter can be tested. You can test each model explicitely by adding its model name (e.g. openai/gpt-4-turbo-preview), or test them all by using the alias model name openrouter/all.
Although Google Translate it is not strictly an LLM, it can be tested with the name google-translate. Note that Google Translate can not be tested via OpenRouter.

Requirements

Ubuntu OS
A mandatory OPENROUTER_API_KEY (cf. https://openrouter.ai/keys), as OpenRouter is used as an intermediary to reach all LLMs except Google Translate.
An optional GOOGLE_TRANSLATION_PROJECT_ID (cf. https://console.cloud.google.com/, assign a project to Cloud Translation API), which is needed if Google Tranlate "LLM" has to be evaluated.
A mandatory OPENAI_API_KEY (cf. https://platform.openai.com/api-keys), as an OpenAI model is used to calculate the evaluation scores.
A mandatory source file of your choice (e.g. samples_br.txt)
An optional target file of your choice (e.g. samples_fr.txt). If not provided, evaluation will not be performed.
An optional glossary file of your choice (e.g. samples_gloss.txt). If provided, the content will be sent to the LLM(s) to help them performing the translation.
a dedicated configuration file (e.g samples_br.yaml)

Installation

git clone https://github.com/marxav/score_br_model.git
cd score_br_model
python3 -m venv env
source env/bin/activate
pip install openai pandas ipykernel tabulate google-generativeai anthropic groq mistralai cohere jupyter google-cloud-translate
echo OPENAI_API_KEY=your-secret-key-1 >> .env
echo OPENROUTER_API_KEY=your-secret-key-2 >> .env
echo GOOGLE_TRANSLATION_PROJECT_ID=your-secret-gt-project-id >> .env
to test google-translate
log to https://console.cloud.google.com/
- Enable Cloud Translation API
go to https://cloud.google.com/sdk/docs/install and install gcloud CLI on your machine
and then
- gcloud init
- gcloud auth application-default login

Run

cd score_br_model
source env/bin/activate
python translate_and_eval.py samples_br.yaml

Results

The results file of a br2fr task as in samples_br.yaml will be similar to:

task	model	score	s_rank	price	p_rank
br2fr	openai/gpt-4-turbo-preview	0.97 ± 0.05	1	0.00378	136
br2fr	openai/gpt-4-vision-preview	0.97 ± 0.05	2	0.00378	137
br2fr	openai/gpt-4-1106-preview	0.97 ± 0.05	3	0.00384	138
br2fr	openai/gpt-4o:extended	0.95 ± 0.07	6	0.001968	122
br2fr	anthropic/claude-3-haiku	0.95 ± 0.07	4	0.00017025	79
br2fr	anthropic/claude-3-haiku:beta	0.95 ± 0.07	5	0.00017025	80
br2fr	anthropic/claude-3-opus	0.95 ± 0.06	7	0.01029	140
br2fr	anthropic/claude-3-opus:beta	0.95 ± 0.07	8	0.01029	141
br2fr	google/palm-2-chat-bison-32k	0.94 ± 0.1	9	0.000304	93
br2fr	openai/gpt-4o-2024-05-13	0.93 ± 0.08	10	0.00161	120
br2fr	openai/gpt-4-turbo	0.93 ± 0.08	11	0.00366	135
br2fr	openai/chatgpt-4o-latest	0.92 ± 0.07	13	0.0017	121
br2fr	openai/gpt-4o-2024-08-06	0.92 ± 0.1	12	0.00094	114
br2fr	anthropic/claude-3.5-sonnet	0.92 ± 0.17	14	0.002073	124
br2fr	anthropic/claude-3.5-sonnet:beta	0.92 ± 0.17	15	0.002073	125
br2fr	openai/o1-preview-2024-09-12	0.91 ± 0.1	16	0.117255	144
br2fr	google/gemini-pro-1.5-exp	0.9 ± 0.14	17	0	1
br2fr	google/gemini-flash-1.5-exp	0.89 ± 0.1	18	0	2
br2fr	nousresearch/hermes-3-llama-3.1-405b:free	0.89 ± 0.15	19	0	3
br2fr	perplexity/llama-3.1-sonar-huge-128k-online	0.89 ± 0.14	20	0.001095	116
br2fr	anthropic/claude-3-sonnet	0.89 ± 0.16	21	0.002073	126
br2fr	anthropic/claude-3-sonnet:beta	0.89 ± 0.16	22	0.002073	127
br2fr	google/gemini-pro-vision	0.88 ± 0.12	23	0.0001775	83
...

The results file of a fr2br task as in samples_fr.yaml will be similar to:

task	model	score	s_rank	price	p_rank
fr2br	openai/gpt-4-0314	0.77 ± 0.19	1	0.00984	137
fr2br	openai/gpt-4-32k-0314	0.77 ± 0.19	2	0.0198	141
fr2br	perplexity/llama-3.1-sonar-huge-128k-online	0.75 ± 0.14	3	0.00106	111
fr2br	openai/o1-preview-2024-09-12	0.74 ± 0.19	4	0.125085	143
fr2br	google/gemini-pro-1.5-exp	0.73 ± 0.17	5	0	1
fr2br	openai/gpt-4-1106-preview	0.73 ± 0.18	6	0.00416	133
fr2br	google/gemini-flash-1.5-exp	0.72 ± 0.17	7	0	2
fr2br	openai/gpt-4o-2024-08-06	0.72 ± 0.15	8	0.001285	113
fr2br	anthropic/claude-3.5-sonnet	0.72 ± 0.16	9	0.002079	122
fr2br	anthropic/claude-3.5-sonnet:beta	0.72 ± 0.16	10	0.002079	123
fr2br	openai/gpt-4-turbo	0.72 ± 0.18	11	0.00422	134
fr2br	openai/gpt-4-vision-preview	0.72 ± 0.19	12	0.00422	135
fr2br	google-translate	0.7 ± 0.16	13	0	3
fr2br	google/palm-2-chat-bison-32k	0.7 ± 0.19	14	0.000332	91
fr2br	anthropic/claude-3-sonnet	0.69 ± 0.16	15	0.002019	120
fr2br	openai/gpt-4-turbo-preview	0.69 ± 0.15	16	0.00404	132
fr2br	meta-llama/llama-3.1-405b-instruct:free	0.68 ± 0.18	17	0	4
fr2br	anthropic/claude-3-sonnet:beta	0.68 ± 0.16	18	0.002019	121
fr2br	perplexity/llama-3.1-sonar-large-128k-chat	0.67 ± 0.13	19	0.000243	84
fr2br	anthropic/claude-3-opus	0.67 ± 0.14	20	0.010395	138
fr2br	nousresearch/hermes-3-llama-3.1-405b:free	0.66 ± 0.17	21	0	5
fr2br	nousresearch/hermes-3-llama-3.1-405b:extended	0.66 ± 0.17	22	0	6
fr2br	anthropic/claude-2.0	0.66 ± 0.12	23	0.003784	127
...

The results presented above serve as a preliminary illustration of the assessment software's functionality. However, they do not constitute a comprehensive evaluation. A rigorous assessment would require the incorporation of reference data across various language registers, utilizing a significantly larger dataset than what is provided in samples_br.txt and samples_fr.txt. Additionally, the target file (e.g., samples_fr.txt) must remain confidential, as its publication online could lead to its eventual assimilation by language models.

More info

The source text to be translated must be in a *.txt file (e.g. samples_br.txt).
In order to evaluate the translation, another file must contain the target translation (e.g. samples_fr.txt), to which the translation will be compared to carry out the evaluation.
Running the translate_and_eval.py creates 2 files
- A log file containing all translations and scores;
  - For example: samples_br_logs.tsv
- A result file containing the summary of scores.
  - For example: samples_br_res.tsv

Todo

Enhance the scoring metric(s)
Add more samples in samples.tsv
Add a leaderboard of the tested LLMs and theirs scores at different tasks
- Either like an LMSYS leaderboard
- Or with via a product like https://scale.com/leaderboard

Warning

Some models can refuse to translate some sentences that they consider as :
- HARM CATEGORY_SEXUALLY_EXPLICIT,
- HARM_CATEGORY_HATE_SPEECH,
- HARM_CATEGORY_HARASSMENT,
- HARM_CATEGORY_DANGEROUS_CONTENT.

Other information

Instead of using this tool, you can manually use LMSYS (in the "Arena side-by-side" tab) to compare the results of 2 models
- In the parameters, set temperature=0.0 and top_p=0.95
- For the br2fr task, input a prompt like:
  - Translate the following Breton text to French. Immediatly write the translated text, nothing more. Do not add any personal comment beyond translation, just translate. The translated text must contain the same number of sentences and same number of '.' characters as in the input text.\n\nC'hoant am eus da ganañ. Ar wirionez zo gantañ. Ar c'hi zo bras awalc'h. Na vezit ket e gortoz e rofen ar respontoù deoc’h. Echu eo.
- For the fr2br task, input a prompt like:
  - Translate the following French text to Breton. Immediatly write the translated text, nothing more. Do not add any personal comment beyond translation, just translate. The translated text must contain the same number of sentences and same number of '.' characters as in the input text.\n\nJ'ai envie de chanter. Il a raison. Le chien est assez grand. Ne vous attendez pas à ce que je vous donne les réponses. C'est fini.

Acknowledgments

tregor_2110_br.txt is a sample of a text written by Gireg Konan (Le Tregor newspaper, n°2110, June 6th 2024).
assimil_76_br.txt is a sample of a text written by Fañch Morvannou (Le Breton sans peine Assimil book, 1990).
maria_prat_br.txt is a sample of a text written by Maria Prat (Eun toullad kontadennou, Skol Uhel ar Vro, 1988).
pipi_gonto_br.txt is a sample of a text written by Dir na dor (Pipi Gonto, E. Kemper, 1926).

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
examples		examples
llms		llms
README.md		README.md
google_translate.py		google_translate.py
input_file.py		input_file.py
requirements.txt		requirements.txt
samples_br.txt		samples_br.txt
samples_br.yaml		samples_br.yaml
samples_br_logs.tsv		samples_br_logs.tsv
samples_br_res.tsv		samples_br_res.tsv
samples_fr.txt		samples_fr.txt
samples_fr.yaml		samples_fr.yaml
samples_gloss.txt		samples_gloss.txt
samples_logs_and_results.ipynb		samples_logs_and_results.ipynb
scores.py		scores.py
transcribe.py		transcribe.py
translate_and_eval.py		translate_and_eval.py
usage.py		usage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

score_br_model

Goal

Requirements

Installation

Run

Results

More info

Todo

Warning

Other information

Acknowledgments

About

Releases

Packages

Languages

marxav/score_br_model

Folders and files

Latest commit

History

Repository files navigation

score_br_model

Goal

Requirements

Installation

Run

Results

More info

Todo

Warning

Other information

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages