Skip to content

Commit

Permalink
Add support for DeepSeek models (#24)
Browse files Browse the repository at this point in the history
* Add support for DeepSeek models
* Add evaluation results for DeepSeek V3
  • Loading branch information
cstub authored Jan 19, 2025
1 parent 788fdbc commit 90020b4
Show file tree
Hide file tree
Showing 13 changed files with 112 additions and 10 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,12 +100,13 @@ https://github.com/user-attachments/assets/83cec179-54dc-456c-b647-ea98ec99600b

## Evaluation

We [evaluated](evaluation) `freeact` using four state-of-the-art models:
We [evaluated](evaluation) `freeact` using five state-of-the-art models:

- Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:

Expand Down
5 changes: 5 additions & 0 deletions docs/api/deepseek.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
::: freeact.model.deepseek.model
options:
show_root_heading: false
members:
- DeepSeek
Binary file modified docs/eval/eval-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `free
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:

Expand Down
5 changes: 3 additions & 2 deletions docs/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ The following models have been [evaluated](evaluation.md) with `freeact`:
- Claude 3.5 Haiku (20241022)
- Gemini 2.0 Flash (experimental)
- Qwen 2.5 Coder 32B Instruct
- DeepSeek V3

For these models, `freeact` provides model-specific prompt templates.

!!! Tip

For best performance, we recommend using Claude 3.5 Sonnet. Support for Gemini 2.0 Flash and Qwen 2.5 Coder is still experimental. The Qwen 2.5 Coder integration is described in [Model integration](#model-integration).
For best performance, we recommend using Claude 3.5 Sonnet. Support for Gemini 2.0 Flash, Qwen 2.5 Coder and DeepSeek V3 is still experimental. The Qwen 2.5 Coder integration is described in [Model integration](#model-integration). The DeepSeek V3 integration follows the same pattern using a custom model class.

## Model integration

Expand Down Expand Up @@ -46,7 +47,7 @@ Start with model-specific prompt templates that guide Qwen 2.5 Coder Instruct mo

!!! Tip

While tested with Qwen 2.5 Coder Instruct, these prompt templates may also serve as a good starting point for other models.
While tested with Qwen 2.5 Coder Instruct, these prompt templates may also serve as a good starting point for other models (e.g. DeepSeek V3 which uses the same prompt templates).

#### Model definition

Expand Down
16 changes: 13 additions & 3 deletions evaluation/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# Evaluation

We evaluated `freeact` using four state-of-the-art models:
We evaluated `freeact` using five state-of-the-art models:

- Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:

Expand All @@ -29,6 +30,10 @@ The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://hu
| qwen2p5-coder-32b-instruct | GSM8K | exact_match | **95.7** |
| qwen2p5-coder-32b-instruct | SimpleQA | exact_match | 52.5 |
| qwen2p5-coder-32b-instruct | SimpleQA | llm_as_judge | 65.0 |
| deepseek-v3 | GAIA | exact_match | 37.5 |
| deepseek-v3 | GSM8K | exact_match | 91.4 |
| deepseek-v3 | SimpleQA | exact_match | 60.0 |
| deepseek-v3 | SimpleQA | llm_as_judge | 67.5 |

When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

Expand Down Expand Up @@ -68,7 +73,7 @@ ANTHROPIC_API_KEY=...
# Gemini 2 Flash Experimental
GOOGLE_API_KEY=...
# Qwen 2.5 Coder 32B Instruct
# Qwen 2.5 Coder 32B Instruct and DeepSeek V3
FIREWORKS_API_KEY=...
# Google Web Search
Expand Down Expand Up @@ -96,6 +101,10 @@ python evaluation/evaluate.py \
python evaluation/evaluate.py \
--model-name qwen2p5-coder-32b-instruct \
--run-id qwen2p5-coder-32b-instruct

python evaluation/evaluate.py \
--model-name deepseek-v3 \
--run-id deepseek-v3
```

Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18433107/evaluation-results-agents-2_medium_benchmark_2.zip).
Expand All @@ -109,7 +118,8 @@ python evaluation/score.py \
--evaluation-dir output/evaluation/claude-3-5-sonnet-20241022 \
--evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
--evaluation-dir output/evaluation/gemini-2.0-flash-exp \
--evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct
--evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
--evaluation-dir output/evaluation/deepseek-v3
```

Generate visualization and reports:
Expand Down
8 changes: 8 additions & 0 deletions evaluation/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
CodeActModel,
CodeActModelTurn,
CodeExecution,
DeepSeek,
Gemini,
QwenCoder,
execution_environment,
Expand Down Expand Up @@ -241,6 +242,13 @@ async def run_agent(
model_name=f"accounts/fireworks/models/{model_name}",
skill_sources=skill_sources,
)
elif model_name == "deepseek-v3":
model = DeepSeek(
api_key=os.getenv("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
model_name=f"accounts/fireworks/models/{model_name}",
skill_sources=skill_sources,
)
else:
raise ValueError(f"Unknown model: {model_name}")

Expand Down
13 changes: 10 additions & 3 deletions evaluation/report.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from pathlib import Path
from typing import Annotated
from typing import Annotated, Literal

import matplotlib.pyplot as plt
import pandas as pd
Expand Down Expand Up @@ -42,6 +42,7 @@ def performance(
hue_order=["GAIA (exact_match)", "GSM8K (exact_match)", "SimpleQA (exact_match)", "SimpleQA (llm_as_judge)"],
title=f"freeact performance on {benchmark_display_name}",
output_file=output_dir / "eval-plot.png",
legend_location="top",
)

print("Results:")
Expand All @@ -63,6 +64,7 @@ def create_barplot(
hue_order: list[str],
title: str,
output_file: Path,
legend_location: Literal["top", "right"] = "right",
):
sns.set_style("whitegrid")
plt.figure(figsize=figsize)
Expand All @@ -82,8 +84,13 @@ def create_barplot(
ax.set_ylabel("% Correct")
ax.spines["top"].set_visible(False)

plt.title(title)
plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")
if legend_location == "top":
plt.title(title, pad=50)
plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.05), loc="center", ncol=2)
else:
plt.title(title)
plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")

plt.xticks(rotation=0, fontsize=8)

if not output_file.parent.exists():
Expand Down
1 change: 1 addition & 0 deletions freeact/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
CodeActModel,
CodeActModelResponse,
CodeActModelTurn,
DeepSeek,
Gemini,
GeminiLive,
GeminiModelName,
Expand Down
13 changes: 12 additions & 1 deletion freeact/cli/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from dotenv import load_dotenv
from rich.console import Console

from freeact import Claude, CodeActAgent, CodeActModel, Gemini, QwenCoder, execution_environment
from freeact import Claude, CodeActAgent, CodeActModel, DeepSeek, Gemini, QwenCoder, execution_environment
from freeact.cli.utils import read_file, stream_conversation

app = typer.Typer()
Expand Down Expand Up @@ -80,6 +80,17 @@ async def amain(
"temperature": temperature,
"max_tokens": max_tokens,
}
elif "deepseek" in model_name.lower():
model = DeepSeek(
model_name=model_name,
skill_sources=skill_sources,
api_key=api_key,
base_url=base_url,
)
run_kwargs |= {
"temperature": temperature,
"max_tokens": max_tokens,
}
else:
typer.echo(f"Unsupported model: {model_name}", err=True)
raise typer.Exit(code=1)
Expand Down
1 change: 1 addition & 0 deletions freeact/model/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from freeact.model.base import CodeActModel, CodeActModelResponse, CodeActModelTurn
from freeact.model.claude.model import Claude, ClaudeModelName
from freeact.model.deepseek.model import DeepSeek
from freeact.model.gemini.model.chat import Gemini, GeminiResponse
from freeact.model.gemini.model.live import GeminiLive, GeminiModelName
from freeact.model.generic.model import GenericModel
Expand Down
Empty file.
56 changes: 56 additions & 0 deletions freeact/model/deepseek/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import os
from typing import Any, Dict

from freeact.model.generic.model import GenericModel
from freeact.model.qwen.prompt import (
EXECUTION_ERROR_TEMPLATE,
EXECUTION_OUTPUT_TEMPLATE,
SYSTEM_TEMPLATE,
)


class DeepSeek(GenericModel):
"""A specialized implementation of `GenericModel` for DeepSeek's models.
This class configures `GenericModel` specifically for use with DeepSeek V3 models
and uses the same prompt templates as Qwen 2.5 Coder.
It has been tested with *DeepSeek V3*. Smaller models
in this series may require adjustments to the prompt templates.
Args:
model_name: The provider-specific name of the DeepSeek model to use.
api_key: Optional API key for DeepSeek. If not provided, reads from DEEPSEEK_API_KEY environment variable.
base_url: Optional base URL for the API. If not provided, reads from DEEPSEEK_BASE_URL environment variable.
skill_sources: Optional string containing Python skill module information to include in system template.
system_template: Prompt template for the system message that guides the model to generate code actions.
Must define a `{python_modules}` placeholder for the skill sources.
execution_output_template: Prompt template for formatting execution outputs.
Must define an `{execution_feedback}` placeholder.
execution_error_template: Prompt template for formatting execution errors.
Must define an `{execution_feedback}` placeholder.
run_kwargs: Defines the stopping conditions for the model.
**kwargs: Additional keyword arguments passed to the `GenericModel` constructor.
"""

def __init__(
self,
model_name: str,
api_key: str | None = None,
base_url: str | None = None,
skill_sources: str | None = None,
system_template: str = SYSTEM_TEMPLATE,
execution_output_template: str = EXECUTION_OUTPUT_TEMPLATE,
execution_error_template: str = EXECUTION_ERROR_TEMPLATE,
run_kwargs: Dict[str, Any] | None = None,
**kwargs,
):
super().__init__(
model_name=model_name,
api_key=api_key or os.getenv("DEEPSEEK_API_KEY"),
base_url=base_url or os.getenv("DEEPSEEK_BASE_URL"),
system_message=system_template.format(python_modules=skill_sources or ""),
execution_output_template=execution_output_template,
execution_error_template=execution_error_template,
run_kwargs=run_kwargs,
**kwargs,
)

0 comments on commit 90020b4

Please sign in to comment.