Skip to content

Commit

Permalink
copy
Browse files Browse the repository at this point in the history
  • Loading branch information
paul-gauthier committed Nov 24, 2024
1 parent 8d0ba40 commit 9257924
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 27 deletions.
37 changes: 18 additions & 19 deletions aider/website/_data/quant.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

- dirname: 2024-11-22-18-56-13--ollama-qwen2.5-coder:32b-instruct-fp16
test_cases: 132
model: ollama/qwen2.5-coder:32b-instruct-fp16 (64k context)
model: Ollama fp16
edit_format: diff
commit_hash: f06452c-dirty, 6a0a97c-dirty, 4e9ae16-dirty, 5506d0f-dirty
pass_rate_1: 58.3
Expand Down Expand Up @@ -70,7 +70,7 @@

- dirname: 2024-11-22-17-53-35--qwen25-coder-32b-Instruct-4bit
test_cases: 133
model: mlx-community/Qwen2.5-Coder-32B-Instruct-4bit
model: mlx-community 4bit
edit_format: diff
commit_hash: a16dcab-dirty
pass_rate_1: 60.2
Expand All @@ -93,7 +93,7 @@

- dirname: 2024-11-23-15-07-20--qwen25-coder-32b-Instruct-8bit
test_cases: 133
model: mlx-community/Qwen2.5-Coder-32B-Instruct-8bit
model: mlx-community 8bit
edit_format: diff
commit_hash: a16dcab-dirty
pass_rate_1: 59.4
Expand Down Expand Up @@ -137,26 +137,25 @@
seconds_per_case: 40.7
total_cost: 0.1497

- dirname: 2024-11-21-23-33-47--ollama-qwen25-coder
- dirname: 2024-11-23-21-08-53--ollama-qwen2.5-coder:32b-instruct-q4_K_M-8kctx
test_cases: 133
model: Ollama Q4_K_M
model: Ollama q4_K_M
edit_format: diff
commit_hash: 488c88d-dirty
pass_rate_1: 44.4
pass_rate_2: 53.4
percent_cases_well_formed: 44.4
error_outputs: 231
num_malformed_responses: 183
num_with_malformed_responses: 74
user_asks: 79
commit_hash: baa1335-dirty, e63df83-dirty, ff8c1aa-dirty
pass_rate_1: 54.9
pass_rate_2: 66.9
percent_cases_well_formed: 94.0
error_outputs: 21
num_malformed_responses: 21
num_with_malformed_responses: 8
user_asks: 5
lazy_comments: 0
syntax_errors: 2
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 2
test_timeouts: 3
command: aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M
date: 2024-11-21
date: 2024-11-23
versions: 0.64.2.dev
seconds_per_case: 86.7
total_cost: 0.0000

seconds_per_case: 35.7
total_cost: 0.0000
24 changes: 16 additions & 8 deletions aider/website/_posts/2024-11-21-quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ can strongly impact code editing skill.
Heavily quantized models are often used by cloud API providers
and local model servers like Ollama or MLX.

<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
<canvas id="quantChart" width="800" height="500" style="margin: 20px 0"></canvas>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script>
{% include quant-chart.js %}
Expand All @@ -29,16 +29,16 @@ served both locally and from cloud providers.

- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
- Hyperbolic labs API for [qwen2-5-coder-32b-instruct](https://app.hyperbolic.xyz/models/qwen2-5-coder-32b-instruct), which is using BF16. This result is probably within the expected variance of the HF result.
- A [4bit quant for mlx](https://t.co/cwX3DYX35D).
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
- The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization.
- Ollama locally serving [qwen2.5-coder:32b-instruct-q4_K_M)](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M), which has `Q4_K_M` quantization, with Ollama's default 2k context window.
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).

The best version of the model rivals GPT-4o, while the worst performer
is more like GPT-3.5 Turbo level.
is more like GPT-4 level.

{: .note }
This article is being updated as additional benchmark runs complete.
The original version included incorrect Ollama models.


<input type="text" id="quantSearchInput" placeholder="Search..." style="width: 100%; max-width: 800px; margin: 10px auto; padding: 8px; display: block; border: 1px solid #ddd; border-radius: 4px;">

Expand Down Expand Up @@ -100,27 +100,30 @@ document.getElementById('quantSearchInput').addEventListener('keyup', function()
});
</script>

## Setting the context window size
## Setting Ollama's context window size

[Ollama uses a 2k context window by default](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size),
which is very small for working with aider.

All of the Ollama results above were collected with at least an 8k context window, which
is large enough to attempt all the coding problems in the benchmark.

You can set the Ollama server's context window with a
[`.aider.model.settings.yml` file](https://aider.chat/docs/config/adv-model-settings.html#model-settings)
like this:

```
- name: aider/extra_params
extra_params:
num_ctx: 65536
num_ctx: 8192
```

That uses the special model name `aider/extra_params` to set it for *all* models. You should probably use a specific model name like:

```
- name: ollama/qwen2.5-coder:32b-instruct-fp16
extra_params:
num_ctx: 65536
num_ctx: 8192
```

## Choosing providers with OpenRouter
Expand All @@ -130,3 +133,8 @@ OpenRouter allows you to ignore specific providers in your
This can be effective to exclude highly quantized or otherwise
undesirable providers.


{: .note }
Earlier versions of this article included incorrect Ollama models,
and also included some Ollama results with the too small default 2k
context window.

0 comments on commit 9257924

Please sign in to comment.