Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Merged
merged 7 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 6 additions & 36 deletions apps/opik-documentation/documentation/docs/cookbook/dspy.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,17 +37,9 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"OPIK: Opik is already configured. You can check the settings by viewing the config file at /Users/jacquesverre/.opik.config\n"
]
}
],
"outputs": [],
"source": [
"import opik\n",
"\n",
Expand All @@ -56,7 +48,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -78,7 +70,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -95,31 +87,9 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:langfuse:Langfuse client is disabled since no public_key was provided as a parameter or environment variable 'LANGFUSE_PUBLIC_KEY'. See our docs: https://langfuse.com/docs/sdk/python/low-level-sdk#initialize-client\n",
"OPIK: Started logging traces to the \"DSPY\" project at https://www.comet.com/opik/jacques-comet/redirect/projects?name=DSPY.\n"
]
},
{
"data": {
"text/plain": [
"Prediction(\n",
" reasoning='The meaning of life is a philosophical question that has been contemplated by humans for centuries. Different cultures, religions, and individuals have proposed various interpretations. Some suggest that the meaning of life is to seek happiness, fulfillment, and personal growth, while others believe it is about serving a higher purpose or contributing to the well-being of others. Ultimately, the meaning of life may vary from person to person, shaped by personal experiences, beliefs, and values.',\n",
" answer=\"The meaning of life is subjective and can vary greatly among individuals. It may involve seeking happiness, personal growth, and contributing to the well-being of others, or fulfilling a higher purpose, depending on one's beliefs and experiences.\"\n",
")"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"cot = dspy.ChainOfThought(\"question -> answer\")\n",
"cot(question=\"What is the meaning of life?\")"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ You can use the following heuristic metrics:
| RegexMatch | Checks if the output matches a specified regular expression pattern |
| IsJson | Checks if the output is a valid JSON object |
| Levenshtein | Calculates the Levenshtein distance between the output and an expected string |
| BLEU | Calculates the BLEU score for output text against one or more reference texts |

## Score an LLM response

Expand Down Expand Up @@ -97,3 +98,64 @@ metric = LevenshteinRatio()
score = metric.score(output="Hello world !", reference="hello")
print(score)
```

### BLEU

The BLEU metric calculates how close the LLM output is to one or more reference translations. This single metric class can compute:
- Single-sentence BLEU: Pass a single output string and one or more reference strings.
- Corpus-level BLEU: Pass a list of output strings and a parallel list of reference strings (or lists of references).

Single-Sentence BLEU

```python
from opik.evaluation.metrics import BLEU

bleu_metric = BLEU()

score = bleu_metric.score(
output="Hello world!",
reference="Hello world"
)
print(score.value, score.reason)

score = bleu_metric.score(
output="Hello world!",
reference=["Hello planet", "Hello world"]
)
print(score.value, score.reason)
```

Corpus-Level BLEU

```python
from opik.evaluation.metrics import BLEU

bleu_metric = BLEU()

outputs = ["Hello there", "This is a test."]
references = [
["Hello world", "Hello there"],
"This is a test."
]

result = bleu_metric.score(output=outputs, reference=references)
print(result.value, result.reason)
```

You can also customize n-grams, smoothing methods, or weights:

```python
from opik.evaluation.metrics import BLEU

metric = BLEU(
n_grams=4,
smoothing_method="method1",
weights=[0.25, 0.25, 0.25, 0.25]
)

score = metric.score(
output="The cat sat on the mat",
reference=["The cat is on the mat", "A cat sat here on the mat"]
)
print(score.value, score.reason)
```
2 changes: 2 additions & 0 deletions sdks/python/src/opik/evaluation/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from .heuristics.is_json import IsJson
from .heuristics.levenshtein_ratio import LevenshteinRatio
from .heuristics.regex_match import RegexMatch
from .heuristics.bleu import BLEU
from .llm_judges.answer_relevance.metric import AnswerRelevance
from .llm_judges.context_precision.metric import ContextPrecision
from .llm_judges.context_recall.metric import ContextRecall
Expand All @@ -29,4 +30,5 @@
"RegexMatch",
"MetricComputationError",
"BaseMetric",
"BLEU",
]
Loading