Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jacques/evaluate prompt #1023

Merged
merged 28 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
60986c5
WIP
jverre Jan 8, 2025
911c01b
WIP
jverre Jan 10, 2025
e7c3cc2
WIP
jverre Jan 10, 2025
13626dc
Update evaluation
jverre Jan 12, 2025
8cafa1c
Update for linters
jverre Jan 12, 2025
79b5f90
Update testing of code blocks
jverre Jan 12, 2025
49f27a3
Update testing of code blocks
jverre Jan 12, 2025
a3d7279
Update testing of code blocks
jverre Jan 12, 2025
e9ded5b
Update github actions
jverre Jan 12, 2025
1e251f7
Fix codeblocks
jverre Jan 12, 2025
b507c0d
Fix codeblocks
jverre Jan 12, 2025
05c2fbd
Fix codeblocks
jverre Jan 12, 2025
b985e5e
Fix codeblocks
jverre Jan 12, 2025
ff3399f
Update github actions
jverre Jan 12, 2025
fe154cd
Update github actions
jverre Jan 12, 2025
367fdba
Update github actions
jverre Jan 12, 2025
b99eb62
Fix codeblocks
jverre Jan 12, 2025
def794b
Updated following review
jverre Jan 13, 2025
8028c65
Updated following review
jverre Jan 13, 2025
131c151
Updated following review
jverre Jan 13, 2025
83c44bd
Move litellm opik monitoring logic to a separate module, add project …
alexkuzmik Jan 14, 2025
f383c09
Fix error_callback -> failure_callback
alexkuzmik Jan 14, 2025
5d217bd
Reorganize imports
alexkuzmik Jan 14, 2025
1b8d875
Make it possible to disable litellm tracking, dont track if litellm a…
alexkuzmik Jan 14, 2025
22ed1a2
Disable litellm monitoring via the callback in tests
alexkuzmik Jan 14, 2025
7a89c69
Merge branch 'main' into jacques/evaluate_prompt
alexkuzmik Jan 14, 2025
c279051
Explicitly disable litellm monitoring in every integration test workflow
alexkuzmik Jan 14, 2025
9dedf64
Fix lint errors
alexkuzmik Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 8 additions & 13 deletions .github/workflows/documentation_codeblock_tests.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,6 @@
name: Documentation - Test codeblocks
on:
workflow_dispatch:
inputs:
install_opik:
description: 'Enable opik installation from source files'
required: false
default: 'false'
type: choice
options:
- 'false'
- 'true'
pull_request:
paths:
- 'apps/opik-documentation/documentation/docs/*.md'
Expand Down Expand Up @@ -63,6 +54,13 @@ jobs:
fail-fast: false
steps:
- uses: actions/checkout@v3
if: github.event_name == 'pull_request'
with:
ref: ${{ github.event.pull_request.head.sha }}
fetch-depth: 0

- uses: actions/checkout@v3
if: github.event_name != 'pull_request'

- name: Set up Python
uses: actions/setup-python@v4
Expand All @@ -75,13 +73,10 @@ jobs:
python -m pip install --upgrade pip
pip install pytest
pip install -r requirements.txt
if [ "${{ github.event.inputs.install_opik }}" = "true" ]; then
pip install -e .
fi

- name: Run tests
working-directory: apps/opik-documentation/documentation
run: |
if [ -n "${{ matrix.path }}" ]; then
pytest ${{ matrix.path }} -v --suppress-no-test-exit-code
pytest ${{ matrix.path }} -v --suppress-no-test-exit-code --default-package=../../../sdks/python
fi
1 change: 1 addition & 0 deletions .github/workflows/lib-integration-tests-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ on:
env:
SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }}
LIBS: ${{ github.event.inputs.libs != '' && github.event.inputs.libs || 'all' }}
OPIK_DISABLE_LITELLM_MODELS_MONITORING: True

jobs:
init_environment:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/python_sdk_unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ on:
- 'main'
paths:
- 'sdks/python/**'
env:
OPIK_DISABLE_LITELLM_MODELS_MONITORING: True
jobs:
UnitTests:
name: Units_Python_${{matrix.python_version}}
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/sdk-e2e-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ on:
paths:
- 'sdks/python/**'
- 'apps/opik-backend/**'

env:
OPIK_DISABLE_LITELLM_MODELS_MONITORING: True
jobs:
run-e2e:
name: SDK E2E Tests ${{matrix.python_version}}
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ cd apps/opik-documentation/documentation
npm install

# Run the documentation website locally
npm run start
npm run dev
```

You can then access the documentation website at `http://localhost:3000`. Any change you make to the documentation will be updated in real-time.
Expand Down
4 changes: 2 additions & 2 deletions apps/opik-documentation/documentation/conftest.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from pytest_codeblocks.pytest_integration import pytest_collect_file
from pytest_codeblocks.pytest_integration import pytest_collect_file, pytest_addoption

# Export the necessary components
__all__ = ["pytest_collect_file"]
__all__ = ["pytest_collect_file", "pytest_addoption"]
2 changes: 1 addition & 1 deletion apps/opik-documentation/documentation/docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ pytest_codeblocks_skip: true

**SDK**:

- Introduced the `Prompt` object in the SDK to manage prompts stored in the library. See the [Prompt Management](/library/managing_prompts_in_code.mdx) guide for more details.
- Introduced the `Prompt` object in the SDK to manage prompts stored in the library. See the [Prompt Management](/prompt_engineering/managing_prompts_in_code.mdx) guide for more details.
- Introduced a `Opik.search_spans` method to search for spans in a project. See the [Search spans](/tracing/export_data.md#exporting-spans) guide for more details.
- Released a new integration with [AWS Bedrock](/tracing/integrations/bedrock.md) for using Opik with Bedrock models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: Introduces the concepts behind Opik's evaluation framework
# Evaluation Concepts

:::tip
If you want to jump straight to running evaluations, you can head to the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
If you want to jump straight to running evaluations, you can head to the [Evaluate prompts](/docs/evaluation/evaluate_prompt.md) or [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) guides.
:::

When working with LLM applications, the bottleneck to iterating faster is often the evaluation process. While it is possible to manually review your LLM application's output, this process is slow and not scalable. Instead of manually reviewing your LLM application's output, Opik allows you to automate the evaluation of your LLM application.
Expand Down Expand Up @@ -63,27 +63,10 @@ Experiment items store the input, expected output, actual output and feedback sc

![Experiment Items](/img/evaluation/experiment_items.png)

## Running an evaluation
## Learn more

When you run an evaluation, you will need to know the following:
We have provided some guides to help you get started with Opik's evaluation framework:

1. Dataset: The dataset you want to run the evaluation on.
2. Evaluation task: This maps the inputs stored in the dataset to the output you would like to score. The evaluation task is typically the LLM application you are building.
3. Metrics: The metrics you would like to use when scoring the outputs of your LLM

You can then run the evaluation using the `evaluate` function:

```python
from opik import evaluate

evaluate(
dataset=dataset,
evaluation_task=evaluation_task,
metrics=metrics,
experiment_config={"prompt_template": "..."},
)
```

:::tip
You can find a full tutorial on defining evaluations in the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
:::
1. [Overview of Opik's evaluation features](/docs/evaluation/overview.mdx)
2. [Evaluate prompts](/docs/evaluation/evaluate_prompt.md)
3. [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md)
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
sidebar_label: Evaluate Prompts
description: Step by step guide on how to evaluate LLM prompts
---

# Evaluate Prompts

When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.

Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.

![Experiment page](/img/evaluation/experiment_items.png)

There are two way to evaluate a prompt in Opik:

1. Using the prompt playground
2. Using the `evaluate_prompt` function in the Python SDK

## Using the prompt playground

The Opik playground allows you to quickly test different prompts and see how they perform.

You can compare multiple prompts to each other by clicking the `+ Add prompt` button in the top right corner of the playground. This will allow you to enter multiple prompts and compare them side by side.

In order to evaluate the prompts on samples, you can add variables to the prompt messages using the `{{variable}}` syntax. You can then connect a dataset and run the prompts on each dataset item.

![Playground evaluation](/img/evaluation/playground_evaluation.gif)

## Using the Python SDK

The Python SDK provides a simple way to evaluate prompts using the `evaluate_prompt` function. This methods allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each dataset item and the output can then be reviewed and annotated in the Opik UI.

To run the experiment, you can use the following code:

```python
import opik
from opik.evaluation import evaluate_prompt

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
)
```

Once the evaluation is complete, you can view the responses in the Opik UI and score each LLM output.

![Experiment page](/img/evaluation/experiment_items.png)

### Automate the scoring process

Manually reviewing each LLM output can be time-consuming and error-prone. The `evaluate_prompt` function allows you to specify a list of scoring metrics which allows you to score each LLM output. Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc and if we don't have the metric you need, you can easily create your own.

You can find a full list of all the Opik supported metrics in the [Metrics Overview](/evaluation/metrics/overview.md) section or you can define your own metric using [Custom Metrics](/evaluation/metrics/custom_metric.md).

By adding the `scoring_metrics` parameter to the `evaluate_prompt` function, you can specify a list of metrics to use for scoring. We will update the example above to use the `Hallucination` metric for scoring:

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
scoring_metrics=[Hallucination()],
)
```

### Customizing the model used

You can customize the model used by create a new model using the [`LiteLLMChatModel`](https://www.comet.com/docs/opik/python-sdk-reference/Objects/LiteLLMChatModel.html) class. This supports passing additional parameters to the model like the `temperature` or base url to use for the model.

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination
from opik.evaluation.models import litellm_chat_model

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model=litellm_chat_model.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
scoring_metrics=[Hallucination()],
)
```

## Next steps

To evaluate comples LLM applications like RAG applications or agents, you can use the [`evaluate`](/evaluation/evaluate_your_llm.md) function.
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
---
sidebar_label: Evaluate your LLM Application
sidebar_label: Evaluate Complex LLM Applications
description: Step by step guide on how to evaluate your LLM application
pytest_codeblocks_execute_previous: true
---

# Evaluate your LLM Application
# Evaluate Complex LLM Applications

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. This evaluation set is often performed both during the development and as part of the testing of an application.
Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.

:::tip
In this guide, we will focus on evaluating complex LLM applications, if you are looking at evaluating single prompts you can referto the [Evaluate a prompt](/evaluation/evaluate_prompt.md) guide.
:::

The evaluation is done in five steps:

Expand Down Expand Up @@ -178,7 +182,7 @@ evaluation = evaluate(

### Linking prompts to experiments

The [Opik prompt library](/library/prompt_management.mdx) can be used to version your prompt templates.
The [Opik prompt library](/prompt_engineering/prompt_management.mdx) can be used to version your prompt templates.

When creating an Experiment, you can link the Experiment to a specific prompt version:

Expand Down Expand Up @@ -238,7 +242,7 @@ In order to evaluate datasets more efficiently, Opik uses multiple background th

You can access all the experiments logged to the platform from the SDK with the [`Opik.get_experiments_by_name`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.get_experiment_by_name) and [`Opik.get_experiment_by_id`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.get_experiment_by_id) methods:

```python
```python pytest_codeblocks_skip=true
import opik

# Get the experiment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,27 @@ Heuristic metrics are deterministic and are often statistical in nature. LLM as

Opik provides the following built-in evaluation metrics:

| Metric | Type | Description | Documentation |
| ---------------- | -------------- | ------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| Equals | Heuristic | Checks if the output exactly matches an expected string | [Equals](/evaluation/metrics/heuristic_metrics#equals) |
| Contains | Heuristic | Check if the output contains a specific substring, can be both case sensitive or case insensitive | [Contains](/evaluation/metrics/heuristic_metrics#contains) |
| RegexMatch | Heuristic | Checks if the output matches a specified regular expression pattern | [RegexMatch](/evaluation/metrics/heuristic_metrics#regexmatch) |
| IsJson | Heuristic | Checks if the output is a valid JSON object | [IsJson](/evaluation/metrics/heuristic_metrics#isjson) |
| Levenshtein | Heuristic | Calculates the Levenshtein distance between the output and an expected string | [Levenshtein](/evaluation/metrics/heuristic_metrics#levenshteinratio) |
| Hallucination | LLM as a Judge | Check if the output contains any hallucinations | [Hallucination](/evaluation/metrics/hallucination) |
| G-Eval | LLM as a Judge | Task agnostic LLM as a Judge metric | [G-Eval](/evaluation/metrics/g_eval) |
| Moderation | LLM as a Judge | Check if the output contains any harmful content | [Moderation](/evaluation/metrics/moderation) |
| AnswerRelevance | LLM as a Judge | Check if the output is relevant to the question | [AnswerRelevance](/evaluation/metrics/answer_relevance) |
| ContextRecall | LLM as a Judge | Check if the output contains any hallucinations | [ContextRecall](/evaluation/metrics/context_recall) |
| ContextPrecision | LLM as a Judge | Check if the output contains any hallucinations | [ContextPrecision](/evaluation/metrics/context_precision) |

You can also create your own custom metric, learn more about it in the [Custom Metric](/evaluation/metrics/custom_metric) section.
| Metric | Type | Description | Documentation |
| ---------------- | -------------- | ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| Equals | Heuristic | Checks if the output exactly matches an expected string | [Equals](/evaluation/metrics/heuristic_metrics.md#equals) |
| Contains | Heuristic | Check if the output contains a specific substring, can be both case sensitive or case insensitive | [Contains](/evaluation/metrics/heuristic_metrics.md#contains) |
| RegexMatch | Heuristic | Checks if the output matches a specified regular expression pattern | [RegexMatch](/evaluation/metrics/heuristic_metrics.md#regexmatch) |
| IsJson | Heuristic | Checks if the output is a valid JSON object | [IsJson](/evaluation/metrics/heuristic_metrics.md#isjson) |
| Levenshtein | Heuristic | Calculates the Levenshtein distance between the output and an expected string | [Levenshtein](/evaluation/metrics/heuristic_metrics.md#levenshteinratio) |
| Hallucination | LLM as a Judge | Check if the output contains any hallucinations | [Hallucination](/evaluation/metrics/hallucination.md) |
| G-Eval | LLM as a Judge | Task agnostic LLM as a Judge metric | [G-Eval](/evaluation/metrics/g_eval.md) |
| Moderation | LLM as a Judge | Check if the output contains any harmful content | [Moderation](/evaluation/metrics/moderation.md) |
| AnswerRelevance | LLM as a Judge | Check if the output is relevant to the question | [AnswerRelevance](/evaluation/metrics/answer_relevance.md) |
| ContextRecall | LLM as a Judge | Check if the output contains any hallucinations | [ContextRecall](/evaluation/metrics/context_recall.md) |
| ContextPrecision | LLM as a Judge | Check if the output contains any hallucinations | [ContextPrecision](/evaluation/metrics/context_precision.md) |

You can also create your own custom metric, learn more about it in the [Custom Metric](/evaluation/metrics/custom_metric.md) section.

## Customizing LLM as a Judge metrics

By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different `model` in the `model_name` parameter of each LLM as a Judge metric.

```python
```python pytest_codeblocks_skip=true
from opik.evaluation.metrics import Hallucination

metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
Expand Down
Loading
Loading