comet-ml · alexkuzmik · Jan 14, 2025 · Jan 8, 2025 · Jan 10, 2025 · Jan 10, 2025
@@ -1,15 +1,6 @@
 name: Documentation - Test codeblocks
 on:
   workflow_dispatch:
-    inputs:
-      install_opik:
-        description: 'Enable opik installation from source files'
-        required: false
-        default: 'false'
-        type: choice
-        options:
-          - 'false'
-          - 'true'
   pull_request:
     paths:
       - 'apps/opik-documentation/documentation/docs/*.md'
@@ -63,6 +54,13 @@ jobs:
       fail-fast: false
     steps:
       - uses: actions/checkout@v3
+        if: github.event_name == 'pull_request'
+        with:
+          ref: ${{ github.event.pull_request.head.sha }}
+          fetch-depth: 0
+
+      - uses: actions/checkout@v3
+        if: github.event_name != 'pull_request'
 
       - name: Set up Python
         uses: actions/setup-python@v4
@@ -75,13 +73,10 @@ jobs:
           python -m pip install --upgrade pip
           pip install pytest
           pip install -r requirements.txt
-          if [ "${{ github.event.inputs.install_opik }}" = "true" ]; then
-            pip install -e .
-          fi
 
       - name: Run tests
         working-directory: apps/opik-documentation/documentation
         run: |
           if [ -n "${{ matrix.path }}" ]; then
-            pytest ${{ matrix.path }} -v --suppress-no-test-exit-code
+            pytest ${{ matrix.path }} -v --suppress-no-test-exit-code --default-package=../../../sdks/python
           fi
diff --git a/.github/workflows/lib-integration-tests-runner.yml b/.github/workflows/lib-integration-tests-runner.yml
@@ -34,6 +34,7 @@ on:
 env:
   SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }}
   LIBS: ${{ github.event.inputs.libs != '' && github.event.inputs.libs  || 'all' }}
+  OPIK_DISABLE_LITELLM_MODELS_MONITORING: True
 
 jobs:
   init_environment:

@@ -9,6 +9,8 @@ on:
       - 'main'
     paths:
       - 'sdks/python/**'
+env:
+  OPIK_DISABLE_LITELLM_MODELS_MONITORING: True
 jobs:
   UnitTests:
     name: Units_Python_${{matrix.python_version}}

@@ -12,7 +12,8 @@ on:
         paths: 
           - 'sdks/python/**'
           - 'apps/opik-backend/**'
-
+env:
+  OPIK_DISABLE_LITELLM_MODELS_MONITORING: True
 jobs:
     run-e2e:
         name: SDK E2E Tests ${{matrix.python_version}}

@@ -157,7 +157,7 @@ cd apps/opik-documentation/documentation
 npm install
 
 # Run the documentation website locally
-npm run start
+npm run dev
 ```
 
 You can then access the documentation website at `http://localhost:3000`. Any change you make to the documentation will be updated in real-time.

@@ -1,4 +1,4 @@
-from pytest_codeblocks.pytest_integration import pytest_collect_file
+from pytest_codeblocks.pytest_integration import pytest_collect_file, pytest_addoption
 
 # Export the necessary components
-__all__ = ["pytest_collect_file"]
+__all__ = ["pytest_collect_file", "pytest_addoption"]
@@ -183,7 +183,7 @@ pytest_codeblocks_skip: true
 
 **SDK**:
 
-- Introduced the `Prompt` object in the SDK to manage prompts stored in the library. See the [Prompt Management](/library/managing_prompts_in_code.mdx) guide for more details.
+- Introduced the `Prompt` object in the SDK to manage prompts stored in the library. See the [Prompt Management](/prompt_engineering/managing_prompts_in_code.mdx) guide for more details.
 - Introduced a `Opik.search_spans` method to search for spans in a project. See the [Search spans](/tracing/export_data.md#exporting-spans) guide for more details.
 - Released a new integration with [AWS Bedrock](/tracing/integrations/bedrock.md) for using Opik with Bedrock models.
 

@@ -6,7 +6,7 @@ description: Introduces the concepts behind Opik's evaluation framework
 # Evaluation Concepts
 
 :::tip
-If you want to jump straight to running evaluations, you can head to the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
+If you want to jump straight to running evaluations, you can head to the [Evaluate prompts](/docs/evaluation/evaluate_prompt.md) or [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) guides.
 :::
 
 When working with LLM applications, the bottleneck to iterating faster is often the evaluation process. While it is possible to manually review your LLM application's output, this process is slow and not scalable. Instead of manually reviewing your LLM application's output, Opik allows you to automate the evaluation of your LLM application.
@@ -63,27 +63,10 @@ Experiment items store the input, expected output, actual output and feedback sc
 
 ![Experiment Items](/img/evaluation/experiment_items.png)
 
-## Running an evaluation
+## Learn more
 
-When you run an evaluation, you will need to know the following:
+We have provided some guides to help you get started with Opik's evaluation framework:
 
-1. Dataset: The dataset you want to run the evaluation on.
-2. Evaluation task: This maps the inputs stored in the dataset to the output you would like to score. The evaluation task is typically the LLM application you are building.
-3. Metrics: The metrics you would like to use when scoring the outputs of your LLM
-
-You can then run the evaluation using the `evaluate` function:
-
-```python
-from opik import evaluate
-
-evaluate(
-    dataset=dataset,
-    evaluation_task=evaluation_task,
-    metrics=metrics,
-    experiment_config={"prompt_template": "..."},
-)
-```
-
-:::tip
-You can find a full tutorial on defining evaluations in the [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md) section.
-:::
+1. [Overview of Opik's evaluation features](/docs/evaluation/overview.mdx)
+2. [Evaluate prompts](/docs/evaluation/evaluate_prompt.md)
+3. [Evaluate your LLM application](/docs/evaluation/evaluate_your_llm.md)
@@ -0,0 +1,124 @@
+---
+sidebar_label: Evaluate Prompts
+description: Step by step guide on how to evaluate LLM prompts
+---
+
+# Evaluate Prompts
+
+When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.
+
+Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.
+
+![Experiment page](/img/evaluation/experiment_items.png)
+
+There are two way to evaluate a prompt in Opik:
+
+1. Using the prompt playground
+2. Using the `evaluate_prompt` function in the Python SDK
+
+## Using the prompt playground
+
+The Opik playground allows you to quickly test different prompts and see how they perform.
+
+You can compare multiple prompts to each other by clicking the `+ Add prompt` button in the top right corner of the playground. This will allow you to enter multiple prompts and compare them side by side.
+
+In order to evaluate the prompts on samples, you can add variables to the prompt messages using the `{{variable}}` syntax. You can then connect a dataset and run the prompts on each dataset item.
+
+![Playground evaluation](/img/evaluation/playground_evaluation.gif)
+
+## Using the Python SDK
+
+The Python SDK provides a simple way to evaluate prompts using the `evaluate_prompt` function. This methods allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each dataset item and the output can then be reviewed and annotated in the Opik UI.
+
+To run the experiment, you can use the following code:
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model="gpt-3.5-turbo",
+)
+```
+
+Once the evaluation is complete, you can view the responses in the Opik UI and score each LLM output.
+
+![Experiment page](/img/evaluation/experiment_items.png)
+
+### Automate the scoring process
+
+Manually reviewing each LLM output can be time-consuming and error-prone. The `evaluate_prompt` function allows you to specify a list of scoring metrics which allows you to score each LLM output. Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc and if we don't have the metric you need, you can easily create your own.
+
+You can find a full list of all the Opik supported metrics in the [Metrics Overview](/evaluation/metrics/overview.md) section or you can define your own metric using [Custom Metrics](/evaluation/metrics/custom_metric.md).
+
+By adding the `scoring_metrics` parameter to the `evaluate_prompt` function, you can specify a list of metrics to use for scoring. We will update the example above to use the `Hallucination` metric for scoring:
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+from opik.evaluation.metrics import Hallucination
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model="gpt-3.5-turbo",
+    scoring_metrics=[Hallucination()],
+)
+```
+
+### Customizing the model used
+
+You can customize the model used by create a new model using the [`LiteLLMChatModel`](https://www.comet.com/docs/opik/python-sdk-reference/Objects/LiteLLMChatModel.html) class. This supports passing additional parameters to the model like the `temperature` or base url to use for the model.
+
+```python
+import opik
+from opik.evaluation import evaluate_prompt
+from opik.evaluation.metrics import Hallucination
+from opik.evaluation.models import litellm_chat_model
+
+# Create a dataset that contains the samples you want to evaluate
+opik_client = opik.Opik()
+dataset = opik_client.get_or_create_dataset("my_dataset")
+dataset.insert([
+    {"input": "Hello, world!", "expected_output": "Hello, world!"},
+    {"input": "What is the capital of France?", "expected_output": "Paris"},
+])
+
+# Run the evaluation
+evaluate_prompt(
+    dataset=dataset,
+    messages=[
+        {"role": "user", "content": "Translate the following text to French: {{input}}"},
+    ],
+    model=litellm_chat_model.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
+    scoring_metrics=[Hallucination()],
+)
+```
+
+## Next steps
+
+To evaluate comples LLM applications like RAG applications or agents, you can use the [`evaluate`](/evaluation/evaluate_your_llm.md) function.
@@ -1,12 +1,16 @@
 ---
-sidebar_label: Evaluate your LLM Application
+sidebar_label: Evaluate Complex LLM Applications
 description: Step by step guide on how to evaluate your LLM application
 pytest_codeblocks_execute_previous: true
 ---
 
-# Evaluate your LLM Application
+# Evaluate Complex LLM Applications
 
-Evaluating your LLM application allows you to have confidence in the performance of your LLM application. This evaluation set is often performed both during the development and as part of the testing of an application.
+Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.
+
+:::tip
+In this guide, we will focus on evaluating complex LLM applications, if you are looking at evaluating single prompts you can referto the [Evaluate a prompt](/evaluation/evaluate_prompt.md) guide.
+:::
 
 The evaluation is done in five steps:
 
@@ -178,7 +182,7 @@ evaluation = evaluate(
 
 ### Linking prompts to experiments
 
-The [Opik prompt library](/library/prompt_management.mdx) can be used to version your prompt templates.
+The [Opik prompt library](/prompt_engineering/prompt_management.mdx) can be used to version your prompt templates.
 
 When creating an Experiment, you can link the Experiment to a specific prompt version:
 
@@ -238,7 +242,7 @@ In order to evaluate datasets more efficiently, Opik uses multiple background th
 
 You can access all the experiments logged to the platform from the SDK with the [`Opik.get_experiments_by_name`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.get_experiment_by_name) and [`Opik.get_experiment_by_id`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.get_experiment_by_id) methods:
 
-```python
+```python pytest_codeblocks_skip=true
 import opik
 
 # Get the experiment

@@ -14,27 +14,27 @@ Heuristic metrics are deterministic and are often statistical in nature. LLM as
 
 Opik provides the following built-in evaluation metrics:
 
-| Metric           | Type           | Description                                                                                       | Documentation                                                         |
-| ---------------- | -------------- | ------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
-| Equals           | Heuristic      | Checks if the output exactly matches an expected string                                           | [Equals](/evaluation/metrics/heuristic_metrics#equals)                |
-| Contains         | Heuristic      | Check if the output contains a specific substring, can be both case sensitive or case insensitive | [Contains](/evaluation/metrics/heuristic_metrics#contains)            |
-| RegexMatch       | Heuristic      | Checks if the output matches a specified regular expression pattern                               | [RegexMatch](/evaluation/metrics/heuristic_metrics#regexmatch)        |
-| IsJson           | Heuristic      | Checks if the output is a valid JSON object                                                       | [IsJson](/evaluation/metrics/heuristic_metrics#isjson)                |
-| Levenshtein      | Heuristic      | Calculates the Levenshtein distance between the output and an expected string                     | [Levenshtein](/evaluation/metrics/heuristic_metrics#levenshteinratio) |
-| Hallucination    | LLM as a Judge | Check if the output contains any hallucinations                                                   | [Hallucination](/evaluation/metrics/hallucination)                    |
-| G-Eval           | LLM as a Judge | Task agnostic LLM as a Judge metric                                                               | [G-Eval](/evaluation/metrics/g_eval)                                  |
-| Moderation       | LLM as a Judge | Check if the output contains any harmful content                                                  | [Moderation](/evaluation/metrics/moderation)                          |
-| AnswerRelevance  | LLM as a Judge | Check if the output is relevant to the question                                                   | [AnswerRelevance](/evaluation/metrics/answer_relevance)               |
-| ContextRecall    | LLM as a Judge | Check if the output contains any hallucinations                                                   | [ContextRecall](/evaluation/metrics/context_recall)                   |
-| ContextPrecision | LLM as a Judge | Check if the output contains any hallucinations                                                   | [ContextPrecision](/evaluation/metrics/context_precision)             |
-
-You can also create your own custom metric, learn more about it in the [Custom Metric](/evaluation/metrics/custom_metric) section.
+| Metric           | Type           | Description                                                                                       | Documentation                                                            |
+| ---------------- | -------------- | ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
+| Equals           | Heuristic      | Checks if the output exactly matches an expected string                                           | [Equals](/evaluation/metrics/heuristic_metrics.md#equals)                |
+| Contains         | Heuristic      | Check if the output contains a specific substring, can be both case sensitive or case insensitive | [Contains](/evaluation/metrics/heuristic_metrics.md#contains)            |
+| RegexMatch       | Heuristic      | Checks if the output matches a specified regular expression pattern                               | [RegexMatch](/evaluation/metrics/heuristic_metrics.md#regexmatch)        |
+| IsJson           | Heuristic      | Checks if the output is a valid JSON object                                                       | [IsJson](/evaluation/metrics/heuristic_metrics.md#isjson)                |
+| Levenshtein      | Heuristic      | Calculates the Levenshtein distance between the output and an expected string                     | [Levenshtein](/evaluation/metrics/heuristic_metrics.md#levenshteinratio) |
+| Hallucination    | LLM as a Judge | Check if the output contains any hallucinations                                                   | [Hallucination](/evaluation/metrics/hallucination.md)                    |
+| G-Eval           | LLM as a Judge | Task agnostic LLM as a Judge metric                                                               | [G-Eval](/evaluation/metrics/g_eval.md)                                  |
+| Moderation       | LLM as a Judge | Check if the output contains any harmful content                                                  | [Moderation](/evaluation/metrics/moderation.md)                          |
+| AnswerRelevance  | LLM as a Judge | Check if the output is relevant to the question                                                   | [AnswerRelevance](/evaluation/metrics/answer_relevance.md)               |
+| ContextRecall    | LLM as a Judge | Check if the output contains any hallucinations                                                   | [ContextRecall](/evaluation/metrics/context_recall.md)                   |
+| ContextPrecision | LLM as a Judge | Check if the output contains any hallucinations                                                   | [ContextPrecision](/evaluation/metrics/context_precision.md)             |
+
+You can also create your own custom metric, learn more about it in the [Custom Metric](/evaluation/metrics/custom_metric.md) section.
 
 ## Customizing LLM as a Judge metrics
 
 By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different `model` in the `model_name` parameter of each LLM as a Judge metric.
 
-```python
+```python pytest_codeblocks_skip=true
 from opik.evaluation.metrics import Hallucination
 
 metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")