WIP

comet-ml · alexkuzmik · Jan 14, 2025 · Jan 8, 2025 · Jan 10, 2025 · Jan 10, 2025
commit 911c01bc836bd44ac42795645dd690d82f212201
@@ -157,7 +157,7 @@ cd apps/opik-documentation/documentation
 npm install
 
 # Run the documentation website locally
-npm run start
+npm run dev
 ```
 
 You can then access the documentation website at `http://localhost:3000`. Any change you make to the documentation will be updated in real-time.

@@ -6,7 +6,7 @@ description: Introduces the concepts behind Opik's evaluation framework
 # Evaluation Concepts
 
 :::tip
-If you want to jump straight to running evaluations, you can head to the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
+If you want to jump straight to running evaluations, you can head to the [Evaluate prompts](/docs/evaluation/evaluate_prompt.md) or [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) guides.
 :::
 
 When working with LLM applications, the bottleneck to iterating faster is often the evaluation process. While it is possible to manually review your LLM application's output, this process is slow and not scalable. Instead of manually reviewing your LLM application's output, Opik allows you to automate the evaluation of your LLM application.
@@ -63,27 +63,10 @@ Experiment items store the input, expected output, actual output and feedback sc
 
 ![Experiment Items](/img/evaluation/experiment_items.png)
 
-## Running an evaluation
+## Learn more
 
-When you run an evaluation, you will need to know the following:
+We have provided some guides to help you get started with Opik's evaluation framework:
 
-1. Dataset: The dataset you want to run the evaluation on.
-2. Evaluation task: This maps the inputs stored in the dataset to the output you would like to score. The evaluation task is typically the LLM application you are building.
-3. Metrics: The metrics you would like to use when scoring the outputs of your LLM
-
-You can then run the evaluation using the `evaluate` function:
-
-```python
-from opik import evaluate
-
-evaluate(
-    dataset=dataset,
-    evaluation_task=evaluation_task,
-    metrics=metrics,
-    experiment_config={"prompt_template": "..."},
-)
-```
-
-:::tip
-You can find a full tutorial on defining evaluations in the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
-:::
+1. [Overview of Opik's evaluation features](/docs/evaluation/overview.md)
+2. [Evaluate prompts](/docs/evaluation/evaluate_prompt.md)
+3. [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md)
@@ -6,9 +6,115 @@ pytest_codeblocks_execute_previous: true
 
 # Evaluate a prompt
 
-You can evaluate a prompt by running the `evaluate_prompt` function. This function takes:
+When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.
 
-1. A dataset: A list of samples to evaluate the prompt on
-2. A prompt: List of messages that wil be evaluated
-3. A model: The model to use for evaluation
-4. Scoring metrics: A list of metrics to evaluate the output on
+Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.
+
+<!-- Image of prompt experiments -->
+
+There are two way to evaluate a prompt in Opik:
+
+1. Using the prompt playground
+2. Using the `evaluate_prompt` function in the Python SDK
+
+## Using the prompt playground
+
+The Opik playground allows you to quickly test different prompts and see how they perform.
+
+You can compare multiple prompts to each other by clicking the `+ Add prompt` button in the top right corner of the playground. This will allow you to enter multiple prompts and compare them side by side.
+
+In order to evaluate the prompts on samples, you can add variables to the prompt messages using the `{{variable}}` syntax. You can then connect a dataset and run the prompts on each dataset item.
+
+<!-- Image of playground -->
+
+## Using the Python SDK
+
+The Python SDK provides a simple way to evaluate prompts using the `evaluate_prompt` function. This methods allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each dataset item and the output can then be reviewed and annotated in the Opik UI.
+
+To run the experiment, you can use the following code:
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model="gpt-3.5-turbo",
+)
+```
+
+Once the evaluation is complete, you can view the responses in the Opik UI and score each LLM output.
+
+<!-- Screenshot of experiment UI  -->
+
+### Automate the scoring process
+
+Manually reviewing each LLM output can be time-consuming and error-prone. The `evaluate_prompt` function allows you to specify a list of scoring metrics which allows you to score each LLM output. Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc and if we don't have the metric you need, you can easily create your own.
+
+You can find a full list of all the Opik supported metrics in the [Metrics Overview](/evaluation/metrics/overview.md) section or you can define your own metric using [Custom Metrics](/evaluation/metrics/custom_metric.md).
+
+By adding the `scoring_metrics` parameter to the `evaluate_prompt` function, you can specify a list of metrics to use for scoring. We will update the example above to use the `Hallucination` metric for scoring:
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+from opik.evaluation.metrics import Hallucination
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model="gpt-3.5-turbo",
+    scoring_metrics=[Hallucination()],
+)
+```
+
+### Customizing the model used
+
+You can customize the model used by create a new model using the [`LiteLLMChatModel`](https://www.comet.com/docs/opik/python-sdk-reference/Objects/LiteLLMChatModel.html) class. This supports passing additional parameters to the model like the `temperature` or base url to use for the model.
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+from opik.evaluation.metrics import Hallucination
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model=opik.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
+    scoring_metrics=[Hallucination()],
+)
+```
@@ -1,12 +1,16 @@
 ---
-sidebar_label: Evaluate your LLM Application
+sidebar_label: Evaluate Complex LLM Applications
 description: Step by step guide on how to evaluate your LLM application
 pytest_codeblocks_execute_previous: true
 ---
 
-# Evaluate your LLM Application
+# Evaluate Complex LLM Applications
 
-Evaluating your LLM application allows you to have confidence in the performance of your LLM application. This evaluation set is often performed both during the development and as part of the testing of an application.
+Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.
+
+:::tip
+In this guide, we will focus on evaluating complex LLM applications, if you are looking at evaluating single prompts you can referto the [Evaluate a prompt](/evaluation/evaluate_prompt.md) guide.
+:::
 
 The evaluation is done in five steps:
 

@@ -0,0 +1,134 @@
+---
+sidebar_label: Overview
+description: A high-level overview on how to use Opik's evaluation features including some code snippets
+---
+
+import Tabs from "@theme/Tabs";
+import TabItem from "@theme/TabItem";
+
+# Overview
+
+Evaluation in Opik helps you assess and measure the quality of your LLM outputs across different dimensions.
+It provides a framework to systematically test your prompts and models against datasets, using various metrics
+to measure performance.
+
+Opik also provides a set of pre-built metrics for common evaluation tasks. These metrics are designed to help you
+quickly and effectively gauge the performance of your LLM outputs and include metrics such as Hallucination,
+Answer Relevance, Context Precision/Recall and more. You can learn more about the available metrics in the
+[Metrics Overview](/evaluation/metrics/overview.md) section.
+
+## Running an Evaluation
+
+Each evaluation is defined by a dataset, an evaluation task and a set of evaluation metrics:
+
+1. **Dataset**: A dataset is a collection of samples that represent the inputs and, optionally, expected outputs for
+   your LLM application.
+2. **Evaluation task**: This maps the inputs stored in the dataset to the output you would like to score. The evaluation
+   task is typically the LLM application you are building.
+3. **Metrics**: The metrics you would like to use when scoring the outputs of your LLM
+
+To simplify the evaluation process, Opik provides two main evaluation methods: `evaluate_prompt` for evaluation prompt
+templates and a more general `evaluate` method for more complex evaluation scenarios.
+
+<Tabs>
+    <TabItem value="Evaluating Prompts" title="Evaluating Prompts">
+
+To evaluate a specific prompt against a dataset:
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+from opik.evaluation.metrics import Hallucination
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("Evaluation test dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+result = evaluate_prompt(
+    dataset=your_dataset,
+    messages=[{"role": "user", "content": "Translate the following text to French: {{input}}"}],
+    model="gpt-3.5-turbo",  # or your preferred model
+    scoring_metrics=[Hallucination()]
+)
+```
+
+</TabItem>
+<TabItem value="Evaluating RAG applications and Agents" title="Evaluating RAG applications and Agents">
+
+For more complex evaluation scenarios where you need custom processing:
+
+```python
+import opik
+from opik.evaluation import evaluate
+from opik.evaluation.metrics import ContextPrecision, ContextRecall
+
+# Create a dataset with questions and their contexts
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("RAG evaluation dataset")
+dataset.insert([
+    {
+        "question": "What are the key features of Python?",
+        "context": "Python is known for its simplicity and readability. Key features include dynamic typing, automatic memory management, and an extensive standard library.",
+        "expected_answer": "Python's key features include dynamic typing, automatic memory management, and an extensive standard library."
+    },
+    {
+        "question": "How does garbage collection work in Python?",
+        "context": "Python uses reference counting and a cyclic garbage collector. When an object's reference count drops to zero, it is deallocated.",
+        "expected_answer": "Python uses reference counting for garbage collection. Objects are deallocated when their reference count reaches zero."
+    }
+])
+
+def rag_task(item):
+    # Simulate RAG pipeline
+    context = retrieve_relevant_context(item["question"])
+    response = generate_response(item["question"], context)
+    return {
+        "question": item["question"],
+        "generated_response": response,
+        "retrieved_context": context,
+        "expected_answer": item["expected_answer"],
+        "ground_truth_context": item["context"]
+    }
+
+# Run the evaluation
+result = evaluate(
+    dataset=dataset,
+    task=rag_task,
+    scoring_metrics=[
+        ContextPrecision(),
+        ContextRecall()
+    ],
+    experiment_name="rag_evaluation"
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Analyzing Evaluation Results
+
+Once the evaluation is complete, Opik allows you to manually review the results and compare them with previous iterations.
+
+![Experiment page](/img/evaluation/experiment_items.png)
+
+In the experiment pages, you will be able to:
+
+1. Review the output provided by the LLM for each sample in the dataset
+2. Deep dive into each sample by clicking on the `item ID`
+3. Review the experiment configuration to know how the experiment was Run
+4. Compare multiple experiments side by side
+
+## Learn more
+
+You can learn more about Opik's evaluation features in:
+
+1. [Evaluation concepts](/evaluation/concepts.md)
+1. [Evaluate prompts](/evaluation/evaluate_prompt.md)
+1. [Evaluate complex LLM applications](/evaluation/evaluate_your_llm.md)
+1. [Evaluation metrics](/evaluation/metrics/overview.md)
+1. [Manage datasets](/evaluation/manage_datasets.md)
@@ -33,7 +33,7 @@ const sidebars: SidebarsConfig = {
     },
     {
       type: "category",
-      label: "Tracing",
+      label: "Observability",
       collapsed: false,
       items: [
         "tracing/log_traces",
@@ -75,11 +75,12 @@ const sidebars: SidebarsConfig = {
       label: "Evaluation",
       collapsed: false,
       items: [
+        "evaluation/overview",
         "evaluation/concepts",
-        "evaluation/manage_datasets",
+        "evaluation/evaluate_prompt",
         "evaluation/evaluate_your_llm",
         "evaluation/update_existing_experiment",
-        "evaluation/playground",
+        "evaluation/manage_datasets",
         {
           type: "category",
           label: "Metrics",
@@ -101,9 +102,13 @@ const sidebars: SidebarsConfig = {
     },
     {
       type: "category",
-      label: "Prompt Management",
+      label: "Prompt engineering",
       collapsed: true,
-      items: ["library/prompt_management", "library/managing_prompts_in_code"],
+      items: [
+        "prompt_engineering/prompt_management",
+        "prompt_engineering/managing_prompts_in_code",
+        "prompt_engineering/playground",
+      ],
     },
     {
       type: "category",

@@ -0,0 +1,4 @@
+evaluate_prompt
+===============
+
+.. autofunction:: opik.evaluation.evaluate_prompt
@@ -178,6 +178,7 @@ You can learn more about the `opik` python SDK in the following sections:
 
    evaluation/Dataset
    evaluation/evaluate
+   evaluation/evaluate_prompt
    evaluation/evaluate_experiment
    evaluation/metrics/index
 

@@ -0,0 +1,19 @@
+import opik
+from opik.evaluation import evaluate_prompt
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"question": "Hello, world!", "expected_output": "Hello, world!"},
+    {"question": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    llm_messages=[
+        {"role": "user", "content": "Translate the following text to French: {{question}}"},
+    ],
+    model="gpt-3.5-turbo",
+)