Add support for DeepSeek models (#24)

* Add support for DeepSeek models * Add evaluation results for DeepSeek V3
gradion-ai · Jan 19, 2025 · 90020b4 · 90020b4
1 parent 788fdbc
commit 90020b4
Show file tree

Hide file tree

Showing 13 changed files with 112 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -100,12 +100,13 @@ https://github.com/user-attachments/assets/83cec179-54dc-456c-b647-ea98ec99600b
 
 ## Evaluation
 
-We [evaluated](evaluation) `freeact` using four state-of-the-art models:
+We [evaluated](evaluation) `freeact` using five state-of-the-art models:
 
 - Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
+- DeepSeek V3 (`deepseek-v3`)
 
 The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
 

diff --git a/docs/api/deepseek.md b/docs/api/deepseek.md
@@ -0,0 +1,5 @@
+::: freeact.model.deepseek.model
+    options:
+      show_root_heading: false
+      members:
+      - DeepSeek
diff --git a/docs/eval/eval-plot.png b/docs/eval/eval-plot.png
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -6,6 +6,7 @@ We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `free
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
+- DeepSeek V3 (`deepseek-v3`)
 
 The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
 

diff --git a/docs/models.md b/docs/models.md
@@ -10,12 +10,13 @@ The following models have been [evaluated](evaluation.md) with `freeact`:
 - Claude 3.5 Haiku (20241022)
 - Gemini 2.0 Flash (experimental)
 - Qwen 2.5 Coder 32B Instruct
+- DeepSeek V3
 
 For these models, `freeact` provides model-specific prompt templates.
 
 !!! Tip
 
-    For best performance, we recommend using Claude 3.5 Sonnet. Support for Gemini 2.0 Flash and Qwen 2.5 Coder is still experimental. The Qwen 2.5 Coder integration is described in [Model integration](#model-integration).
+    For best performance, we recommend using Claude 3.5 Sonnet. Support for Gemini 2.0 Flash, Qwen 2.5 Coder and DeepSeek V3 is still experimental. The Qwen 2.5 Coder integration is described in [Model integration](#model-integration). The DeepSeek V3 integration follows the same pattern using a custom model class.
 
 ## Model integration
 
@@ -46,7 +47,7 @@ Start with model-specific prompt templates that guide Qwen 2.5 Coder Instruct mo
 
 !!! Tip
 
-    While tested with Qwen 2.5 Coder Instruct, these prompt templates may also serve as a good starting point for other models.
+    While tested with Qwen 2.5 Coder Instruct, these prompt templates may also serve as a good starting point for other models (e.g. DeepSeek V3 which uses the same prompt templates).
 
 #### Model definition
 

diff --git a/evaluation/README.md b/evaluation/README.md
@@ -1,11 +1,12 @@
 # Evaluation
 
-We evaluated `freeact` using four state-of-the-art models:
+We evaluated `freeact` using five state-of-the-art models:
 
 - Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
+- DeepSeek V3 (`deepseek-v3`)
 
 The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
 
@@ -29,6 +30,10 @@ The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://hu
 | qwen2p5-coder-32b-instruct | GSM8K    | exact_match     |  **95.7** |
 | qwen2p5-coder-32b-instruct | SimpleQA | exact_match     |      52.5 |
 | qwen2p5-coder-32b-instruct | SimpleQA | llm_as_judge    |      65.0 |
+| deepseek-v3                | GAIA     | exact_match     |      37.5 |
+| deepseek-v3                | GSM8K    | exact_match     |      91.4 |
+| deepseek-v3                | SimpleQA | exact_match     |      60.0 |
+| deepseek-v3                | SimpleQA | llm_as_judge    |      67.5 |
 
 When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
 
@@ -68,7 +73,7 @@ ANTHROPIC_API_KEY=...
 # Gemini 2 Flash Experimental
 GOOGLE_API_KEY=...
 
-# Qwen 2.5 Coder 32B Instruct
+# Qwen 2.5 Coder 32B Instruct and DeepSeek V3
 FIREWORKS_API_KEY=...
 
 # Google Web Search
@@ -96,6 +101,10 @@ python evaluation/evaluate.py \
 python evaluation/evaluate.py \
     --model-name qwen2p5-coder-32b-instruct \
     --run-id qwen2p5-coder-32b-instruct
+
+python evaluation/evaluate.py \
+    --model-name deepseek-v3 \
+    --run-id deepseek-v3
 ```
 
 Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18433107/evaluation-results-agents-2_medium_benchmark_2.zip).
@@ -109,7 +118,8 @@ python evaluation/score.py \
   --evaluation-dir output/evaluation/claude-3-5-sonnet-20241022 \
   --evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
   --evaluation-dir output/evaluation/gemini-2.0-flash-exp \
-  --evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct
+  --evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
+  --evaluation-dir output/evaluation/deepseek-v3
 ```
 
 Generate visualization and reports:

diff --git a/evaluation/evaluate.py b/evaluation/evaluate.py
@@ -20,6 +20,7 @@
     CodeActModel,
     CodeActModelTurn,
     CodeExecution,
+    DeepSeek,
     Gemini,
     QwenCoder,
     execution_environment,
@@ -241,6 +242,13 @@ async def run_agent(
                 model_name=f"accounts/fireworks/models/{model_name}",
                 skill_sources=skill_sources,
             )
+        elif model_name == "deepseek-v3":
+            model = DeepSeek(
+                api_key=os.getenv("FIREWORKS_API_KEY"),
+                base_url="https://api.fireworks.ai/inference/v1",
+                model_name=f"accounts/fireworks/models/{model_name}",
+                skill_sources=skill_sources,
+            )
         else:
             raise ValueError(f"Unknown model: {model_name}")
 

diff --git a/evaluation/report.py b/evaluation/report.py
@@ -1,5 +1,5 @@
 from pathlib import Path
-from typing import Annotated
+from typing import Annotated, Literal
 
 import matplotlib.pyplot as plt
 import pandas as pd
@@ -42,6 +42,7 @@ def performance(
         hue_order=["GAIA (exact_match)", "GSM8K (exact_match)", "SimpleQA (exact_match)", "SimpleQA (llm_as_judge)"],
         title=f"freeact performance on {benchmark_display_name}",
         output_file=output_dir / "eval-plot.png",
+        legend_location="top",
     )
 
     print("Results:")
@@ -63,6 +64,7 @@ def create_barplot(
     hue_order: list[str],
     title: str,
     output_file: Path,
+    legend_location: Literal["top", "right"] = "right",
 ):
     sns.set_style("whitegrid")
     plt.figure(figsize=figsize)
@@ -82,8 +84,13 @@ def create_barplot(
     ax.set_ylabel("% Correct")
     ax.spines["top"].set_visible(False)
 
-    plt.title(title)
-    plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")
+    if legend_location == "top":
+        plt.title(title, pad=50)
+        plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.05), loc="center", ncol=2)
+    else:
+        plt.title(title)
+        plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")
+
     plt.xticks(rotation=0, fontsize=8)
 
     if not output_file.parent.exists():

diff --git a/freeact/__init__.py b/freeact/__init__.py
@@ -6,6 +6,7 @@
     CodeActModel,
     CodeActModelResponse,
     CodeActModelTurn,
+    DeepSeek,
     Gemini,
     GeminiLive,
     GeminiModelName,

diff --git a/freeact/cli/__main__.py b/freeact/cli/__main__.py
@@ -6,7 +6,7 @@
 from dotenv import load_dotenv
 from rich.console import Console
 
-from freeact import Claude, CodeActAgent, CodeActModel, Gemini, QwenCoder, execution_environment
+from freeact import Claude, CodeActAgent, CodeActModel, DeepSeek, Gemini, QwenCoder, execution_environment
 from freeact.cli.utils import read_file, stream_conversation
 
 app = typer.Typer()
@@ -80,6 +80,17 @@ async def amain(
                 "temperature": temperature,
                 "max_tokens": max_tokens,
             }
+        elif "deepseek" in model_name.lower():
+            model = DeepSeek(
+                model_name=model_name,
+                skill_sources=skill_sources,
+                api_key=api_key,
+                base_url=base_url,
+            )
+            run_kwargs |= {
+                "temperature": temperature,
+                "max_tokens": max_tokens,
+            }
         else:
             typer.echo(f"Unsupported model: {model_name}", err=True)
             raise typer.Exit(code=1)

diff --git a/freeact/model/__init__.py b/freeact/model/__init__.py
@@ -1,5 +1,6 @@
 from freeact.model.base import CodeActModel, CodeActModelResponse, CodeActModelTurn
 from freeact.model.claude.model import Claude, ClaudeModelName
+from freeact.model.deepseek.model import DeepSeek
 from freeact.model.gemini.model.chat import Gemini, GeminiResponse
 from freeact.model.gemini.model.live import GeminiLive, GeminiModelName
 from freeact.model.generic.model import GenericModel

diff --git a/freeact/model/deepseek/__init__.py b/freeact/model/deepseek/__init__.py
diff --git a/freeact/model/deepseek/model.py b/freeact/model/deepseek/model.py
@@ -0,0 +1,56 @@
+import os
+from typing import Any, Dict
+
+from freeact.model.generic.model import GenericModel
+from freeact.model.qwen.prompt import (
+    EXECUTION_ERROR_TEMPLATE,
+    EXECUTION_OUTPUT_TEMPLATE,
+    SYSTEM_TEMPLATE,
+)
+
+
+class DeepSeek(GenericModel):
+    """A specialized implementation of `GenericModel` for DeepSeek's models.
+
+    This class configures `GenericModel` specifically for use with DeepSeek V3 models
+    and uses the same prompt templates as Qwen 2.5 Coder.
+    It has been tested with *DeepSeek V3*. Smaller models
+    in this series may require adjustments to the prompt templates.
+
+    Args:
+        model_name: The provider-specific name of the DeepSeek model to use.
+        api_key: Optional API key for DeepSeek. If not provided, reads from DEEPSEEK_API_KEY environment variable.
+        base_url: Optional base URL for the API. If not provided, reads from DEEPSEEK_BASE_URL environment variable.
+        skill_sources: Optional string containing Python skill module information to include in system template.
+        system_template: Prompt template for the system message that guides the model to generate code actions.
+            Must define a `{python_modules}` placeholder for the skill sources.
+        execution_output_template: Prompt template for formatting execution outputs.
+            Must define an `{execution_feedback}` placeholder.
+        execution_error_template: Prompt template for formatting execution errors.
+            Must define an `{execution_feedback}` placeholder.
+        run_kwargs: Defines the stopping conditions for the model.
+        **kwargs: Additional keyword arguments passed to the `GenericModel` constructor.
+    """
+
+    def __init__(
+        self,
+        model_name: str,
+        api_key: str | None = None,
+        base_url: str | None = None,
+        skill_sources: str | None = None,
+        system_template: str = SYSTEM_TEMPLATE,
+        execution_output_template: str = EXECUTION_OUTPUT_TEMPLATE,
+        execution_error_template: str = EXECUTION_ERROR_TEMPLATE,
+        run_kwargs: Dict[str, Any] | None = None,
+        **kwargs,
+    ):
+        super().__init__(
+            model_name=model_name,
+            api_key=api_key or os.getenv("DEEPSEEK_API_KEY"),
+            base_url=base_url or os.getenv("DEEPSEEK_BASE_URL"),
+            system_message=system_template.format(python_modules=skill_sources or ""),
+            execution_output_template=execution_output_template,
+            execution_error_template=execution_error_template,
+            run_kwargs=run_kwargs,
+            **kwargs,
+        )