Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jacques/evaluate prompt #1023

Merged
merged 28 commits into from
Jan 14, 2025
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
60986c5
WIP
jverre Jan 8, 2025
911c01b
WIP
jverre Jan 10, 2025
e7c3cc2
WIP
jverre Jan 10, 2025
13626dc
Update evaluation
jverre Jan 12, 2025
8cafa1c
Update for linters
jverre Jan 12, 2025
79b5f90
Update testing of code blocks
jverre Jan 12, 2025
49f27a3
Update testing of code blocks
jverre Jan 12, 2025
a3d7279
Update testing of code blocks
jverre Jan 12, 2025
e9ded5b
Update github actions
jverre Jan 12, 2025
1e251f7
Fix codeblocks
jverre Jan 12, 2025
b507c0d
Fix codeblocks
jverre Jan 12, 2025
05c2fbd
Fix codeblocks
jverre Jan 12, 2025
b985e5e
Fix codeblocks
jverre Jan 12, 2025
ff3399f
Update github actions
jverre Jan 12, 2025
fe154cd
Update github actions
jverre Jan 12, 2025
367fdba
Update github actions
jverre Jan 12, 2025
b99eb62
Fix codeblocks
jverre Jan 12, 2025
def794b
Updated following review
jverre Jan 13, 2025
8028c65
Updated following review
jverre Jan 13, 2025
131c151
Updated following review
jverre Jan 13, 2025
83c44bd
Move litellm opik monitoring logic to a separate module, add project …
alexkuzmik Jan 14, 2025
f383c09
Fix error_callback -> failure_callback
alexkuzmik Jan 14, 2025
5d217bd
Reorganize imports
alexkuzmik Jan 14, 2025
1b8d875
Make it possible to disable litellm tracking, dont track if litellm a…
alexkuzmik Jan 14, 2025
22ed1a2
Disable litellm monitoring via the callback in tests
alexkuzmik Jan 14, 2025
7a89c69
Merge branch 'main' into jacques/evaluate_prompt
alexkuzmik Jan 14, 2025
c279051
Explicitly disable litellm monitoring in every integration test workflow
alexkuzmik Jan 14, 2025
9dedf64
Fix lint errors
alexkuzmik Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
WIP
jverre committed Jan 10, 2025
commit 911c01bc836bd44ac42795645dd690d82f212201
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -157,7 +157,7 @@ cd apps/opik-documentation/documentation
npm install

# Run the documentation website locally
npm run start
npm run dev
```

You can then access the documentation website at `http://localhost:3000`. Any change you make to the documentation will be updated in real-time.
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@ description: Introduces the concepts behind Opik's evaluation framework
# Evaluation Concepts

:::tip
If you want to jump straight to running evaluations, you can head to the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
If you want to jump straight to running evaluations, you can head to the [Evaluate prompts](/docs/evaluation/evaluate_prompt.md) or [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) guides.
:::

When working with LLM applications, the bottleneck to iterating faster is often the evaluation process. While it is possible to manually review your LLM application's output, this process is slow and not scalable. Instead of manually reviewing your LLM application's output, Opik allows you to automate the evaluation of your LLM application.
@@ -63,27 +63,10 @@ Experiment items store the input, expected output, actual output and feedback sc

![Experiment Items](/img/evaluation/experiment_items.png)

## Running an evaluation
## Learn more

When you run an evaluation, you will need to know the following:
We have provided some guides to help you get started with Opik's evaluation framework:

1. Dataset: The dataset you want to run the evaluation on.
2. Evaluation task: This maps the inputs stored in the dataset to the output you would like to score. The evaluation task is typically the LLM application you are building.
3. Metrics: The metrics you would like to use when scoring the outputs of your LLM

You can then run the evaluation using the `evaluate` function:

```python
from opik import evaluate

evaluate(
dataset=dataset,
evaluation_task=evaluation_task,
metrics=metrics,
experiment_config={"prompt_template": "..."},
)
```

:::tip
You can find a full tutorial on defining evaluations in the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
:::
1. [Overview of Opik's evaluation features](/docs/evaluation/overview.md)
2. [Evaluate prompts](/docs/evaluation/evaluate_prompt.md)
3. [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md)
Original file line number Diff line number Diff line change
@@ -6,9 +6,115 @@ pytest_codeblocks_execute_previous: true

# Evaluate a prompt

You can evaluate a prompt by running the `evaluate_prompt` function. This function takes:
When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.

1. A dataset: A list of samples to evaluate the prompt on
2. A prompt: List of messages that wil be evaluated
3. A model: The model to use for evaluation
4. Scoring metrics: A list of metrics to evaluate the output on
Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.

<!-- Image of prompt experiments -->

There are two way to evaluate a prompt in Opik:

1. Using the prompt playground
2. Using the `evaluate_prompt` function in the Python SDK

## Using the prompt playground

The Opik playground allows you to quickly test different prompts and see how they perform.

You can compare multiple prompts to each other by clicking the `+ Add prompt` button in the top right corner of the playground. This will allow you to enter multiple prompts and compare them side by side.

In order to evaluate the prompts on samples, you can add variables to the prompt messages using the `{{variable}}` syntax. You can then connect a dataset and run the prompts on each dataset item.

<!-- Image of playground -->

## Using the Python SDK

The Python SDK provides a simple way to evaluate prompts using the `evaluate_prompt` function. This methods allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each dataset item and the output can then be reviewed and annotated in the Opik UI.

To run the experiment, you can use the following code:

```python
import opik
from opik.evaluation import evaluate_prompt

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
)
```

Once the evaluation is complete, you can view the responses in the Opik UI and score each LLM output.

<!-- Screenshot of experiment UI -->

### Automate the scoring process

Manually reviewing each LLM output can be time-consuming and error-prone. The `evaluate_prompt` function allows you to specify a list of scoring metrics which allows you to score each LLM output. Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc and if we don't have the metric you need, you can easily create your own.

You can find a full list of all the Opik supported metrics in the [Metrics Overview](/evaluation/metrics/overview.md) section or you can define your own metric using [Custom Metrics](/evaluation/metrics/custom_metric.md).

By adding the `scoring_metrics` parameter to the `evaluate_prompt` function, you can specify a list of metrics to use for scoring. We will update the example above to use the `Hallucination` metric for scoring:

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
scoring_metrics=[Hallucination()],
)
```

### Customizing the model used

You can customize the model used by create a new model using the [`LiteLLMChatModel`](https://www.comet.com/docs/opik/python-sdk-reference/Objects/LiteLLMChatModel.html) class. This supports passing additional parameters to the model like the `temperature` or base url to use for the model.

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model=opik.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
scoring_metrics=[Hallucination()],
)
```
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
---
sidebar_label: Evaluate your LLM Application
sidebar_label: Evaluate Complex LLM Applications
description: Step by step guide on how to evaluate your LLM application
pytest_codeblocks_execute_previous: true
---

# Evaluate your LLM Application
# Evaluate Complex LLM Applications

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. This evaluation set is often performed both during the development and as part of the testing of an application.
Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.

:::tip
In this guide, we will focus on evaluating complex LLM applications, if you are looking at evaluating single prompts you can referto the [Evaluate a prompt](/evaluation/evaluate_prompt.md) guide.
:::

The evaluation is done in five steps:

134 changes: 134 additions & 0 deletions apps/opik-documentation/documentation/docs/evaluation/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
sidebar_label: Overview
description: A high-level overview on how to use Opik's evaluation features including some code snippets
---

import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";

# Overview

Evaluation in Opik helps you assess and measure the quality of your LLM outputs across different dimensions.
It provides a framework to systematically test your prompts and models against datasets, using various metrics
to measure performance.

Opik also provides a set of pre-built metrics for common evaluation tasks. These metrics are designed to help you
quickly and effectively gauge the performance of your LLM outputs and include metrics such as Hallucination,
Answer Relevance, Context Precision/Recall and more. You can learn more about the available metrics in the
[Metrics Overview](/evaluation/metrics/overview.md) section.

## Running an Evaluation

Each evaluation is defined by a dataset, an evaluation task and a set of evaluation metrics:

1. **Dataset**: A dataset is a collection of samples that represent the inputs and, optionally, expected outputs for
your LLM application.
2. **Evaluation task**: This maps the inputs stored in the dataset to the output you would like to score. The evaluation
task is typically the LLM application you are building.
3. **Metrics**: The metrics you would like to use when scoring the outputs of your LLM

To simplify the evaluation process, Opik provides two main evaluation methods: `evaluate_prompt` for evaluation prompt
templates and a more general `evaluate` method for more complex evaluation scenarios.

<Tabs>
<TabItem value="Evaluating Prompts" title="Evaluating Prompts">

To evaluate a specific prompt against a dataset:

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("Evaluation test dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
result = evaluate_prompt(
dataset=your_dataset,
messages=[{"role": "user", "content": "Translate the following text to French: {{input}}"}],
model="gpt-3.5-turbo", # or your preferred model
scoring_metrics=[Hallucination()]
)
```

</TabItem>
<TabItem value="Evaluating RAG applications and Agents" title="Evaluating RAG applications and Agents">

For more complex evaluation scenarios where you need custom processing:

```python
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import ContextPrecision, ContextRecall

# Create a dataset with questions and their contexts
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("RAG evaluation dataset")
dataset.insert([
{
"question": "What are the key features of Python?",
"context": "Python is known for its simplicity and readability. Key features include dynamic typing, automatic memory management, and an extensive standard library.",
"expected_answer": "Python's key features include dynamic typing, automatic memory management, and an extensive standard library."
},
{
"question": "How does garbage collection work in Python?",
"context": "Python uses reference counting and a cyclic garbage collector. When an object's reference count drops to zero, it is deallocated.",
"expected_answer": "Python uses reference counting for garbage collection. Objects are deallocated when their reference count reaches zero."
}
])

def rag_task(item):
# Simulate RAG pipeline
context = retrieve_relevant_context(item["question"])
response = generate_response(item["question"], context)
return {
"question": item["question"],
"generated_response": response,
"retrieved_context": context,
"expected_answer": item["expected_answer"],
"ground_truth_context": item["context"]
}

# Run the evaluation
result = evaluate(
dataset=dataset,
task=rag_task,
scoring_metrics=[
ContextPrecision(),
ContextRecall()
],
experiment_name="rag_evaluation"
)
```

</TabItem>
</Tabs>

## Analyzing Evaluation Results

Once the evaluation is complete, Opik allows you to manually review the results and compare them with previous iterations.

![Experiment page](/img/evaluation/experiment_items.png)

In the experiment pages, you will be able to:

1. Review the output provided by the LLM for each sample in the dataset
2. Deep dive into each sample by clicking on the `item ID`
3. Review the experiment configuration to know how the experiment was Run
4. Compare multiple experiments side by side

## Learn more

You can learn more about Opik's evaluation features in:

1. [Evaluation concepts](/evaluation/concepts.md)
1. [Evaluate prompts](/evaluation/evaluate_prompt.md)
1. [Evaluate complex LLM applications](/evaluation/evaluate_your_llm.md)
1. [Evaluation metrics](/evaluation/metrics/overview.md)
1. [Manage datasets](/evaluation/manage_datasets.md)
Empty file.
15 changes: 10 additions & 5 deletions apps/opik-documentation/documentation/sidebars.ts
Original file line number Diff line number Diff line change
@@ -33,7 +33,7 @@ const sidebars: SidebarsConfig = {
},
{
type: "category",
label: "Tracing",
label: "Observability",
collapsed: false,
items: [
"tracing/log_traces",
@@ -75,11 +75,12 @@ const sidebars: SidebarsConfig = {
label: "Evaluation",
collapsed: false,
items: [
"evaluation/overview",
"evaluation/concepts",
"evaluation/manage_datasets",
"evaluation/evaluate_prompt",
"evaluation/evaluate_your_llm",
"evaluation/update_existing_experiment",
"evaluation/playground",
"evaluation/manage_datasets",
{
type: "category",
label: "Metrics",
@@ -101,9 +102,13 @@ const sidebars: SidebarsConfig = {
},
{
type: "category",
label: "Prompt Management",
label: "Prompt engineering",
collapsed: true,
items: ["library/prompt_management", "library/managing_prompts_in_code"],
items: [
"prompt_engineering/prompt_management",
"prompt_engineering/managing_prompts_in_code",
"prompt_engineering/playground",
],
},
{
type: "category",
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
evaluate_prompt
===============

.. autofunction:: opik.evaluation.evaluate_prompt
1 change: 1 addition & 0 deletions apps/opik-documentation/python-sdk-docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -178,6 +178,7 @@ You can learn more about the `opik` python SDK in the following sections:

evaluation/Dataset
evaluation/evaluate
evaluation/evaluate_prompt
evaluation/evaluate_experiment
evaluation/metrics/index

19 changes: 19 additions & 0 deletions sdks/python/examples/evaluate_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import opik
from opik.evaluation import evaluate_prompt

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset")
dataset.insert([
{"question": "Hello, world!", "expected_output": "Hello, world!"},
{"question": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
dataset=dataset,
llm_messages=[
{"role": "user", "content": "Translate the following text to French: {{question}}"},
],
model="gpt-3.5-turbo",
)
Loading