Add 100 Samples Per Regex / JSON Schema #35

lapp0 · 2024-10-11T16:49:24Z

Fixes #19

Changes

Adds 100 samples for each schema / pattern to src/samples/
data.py: Remove example key and replace with samples key
Update all src/benchmark_*.py ASV benchmark scripts to run 100 samples per benchmark

Caveat: We need to use RegexGuide.from_regex once dottxt-ai/outlines#1204 is merged and outlines version is bumped.

Sample Generation Scripts

`phone_number.json`

import random
import json

def generate_phone_number():
    # Generate 3 random digits, 3 random digits, and 4 random digits for the phone number
    area_code = f'{random.randint(100, 999)}'
    prefix = f'{random.randint(100, 999)}'
    line_number = f'{random.randint(1000, 9999)}'

    # Combine the parts into the format XXX-XXX-XXXX
    return f'{area_code}-{prefix}-{line_number}'

# Create a list of 100 phone numbers
phone_numbers = [generate_phone_number() for _ in range(100)]

print(json.dumps(phone_numbers))

`url.json`

import pandas as pd
import json

url = 'https://raw.githubusercontent.com/steciuk/SNA-reddit-bipartite-analysis/2fc2b2920ab1ff173ae457b4b1fcd490eb1aee16/data/posts_technews.csv'
df = pd.read_csv(url)

url_column_list = df['url'].tolist()

print(json.dumps(url_column_list[:100]))

`gsm8k.json`

from datasets import load_dataset
import json

dataset = load_dataset("thesven/gsm8k-reasoning", split="train")
dataset = dataset.map(lambda row: {"answer": row["answer"].split("<<")[0].split("=")[0].strip()})

gsm8k_thinking = dataset.select(range(100))["answer"]

print(json.dumps([gt + ". The answer is 42." for gt in gsm8k_thinking]))

`complex_str.json`

import random
import json


def random_string_from_pattern():
    # Define the patterns to choose from
    patterns = [
        r'(0|[1-9][0-9]*)',  # Integer pattern
        r'true',             # True boolean
        r'false',            # False boolean
        r'([a-zA-Z_][a-zA-Z_0-9]*)'  # Identifier pattern (letters, digits, underscore)
    ]

    # Randomly select one pattern
    selected_pattern = random.choice(patterns)

    # If it's the integer pattern, generate a random integer
    if selected_pattern == r'(0|[1-9][0-9]*)':
        return str(random.choice([0] + [random.randint(1, 100)]))

    # If it's the identifier pattern, generate a random identifier
    elif selected_pattern == r'([a-zA-Z_][a-zA-Z_0-9]*)':
        identifier_length = random.randint(1, 10)
        identifier = ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_', k=1))  # First character
        identifier += ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789', k=identifier_length - 1))
        return identifier

    # If it's true or false, just return the string 'true' or 'false'
    else:
        return selected_pattern


def generate_random_string(n):
    return ''.join(random_string_from_pattern() for _ in range(n))


data = [generate_random_string(random.randint(1, 10)) for _ in range(100)]
print(json.dumps(data))

`long_integer.json`

import random
import json


def random_long_number():
    first_digit = random.choice(range(1, 10))

    remaining_digits_length = random.randint(1, 14)
    remaining_digits = ''.join(random.choices('0123456789', k=remaining_digits_length))

    return f"+{first_digit}{remaining_digits}"


data = [random_long_number() for _ in range(100)]
print(json.dumps(data))

`recording_schema.json` and `rpg_characters.json`

import outlines
import json


JSON_SCHEMA = None  # TODO: Put schema here


qwen_model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct", model_kwargs=dict(load_in_8bit=True))


def create_input(prompt):
    return qwen_model.tokenizer.tokenizer.apply_chat_template(
        [
            {"role": "system", "content": "You are a helpful AI assistant. You only speak English."},
            {"role": "user", "content": prompt}
        ],
        tokenize=False,
        add_generation_prompt=True,
    )


generator = outlines.generate.json(qwen_model, json.dumps(JSON_SCHEMA))


results = []
for _ in range(25):
    inputs = [
        create_input(f"For the schema\n\n{JSON_SCHEMA}\n\nThis is a valid json:\n")
        for _ in range(4)
    ]
    while True:
        try:
            results += generator(inputs, max_tokens=1000)
            break
        except Exception as e:
            print(_, e)

print(json.dumps(results))

TODO

Figure out why outlines-core is faster than outlines on regex and conversely on JSON.
Bump to outlines-core's latest release
Separate "compilation" (TTFT implies "at every run" and number of tokens / second after compilation.

rlouf · 2024-10-15T12:35:59Z

src/benchmark_lfe.py

-        for i in range(len(regex_example_tokens)):
-            _ = token_enforcer.get_allowed_tokens(regex_example_tokens[: i + 1])
+        for regex_sample in regex_samples:
+            regex_sample_tokens = self.tokenizer.encode(regex_sample)


Let's get this out of the timing method by pre-tokenizing the samples so we don't time this.

rlouf · 2024-10-15T14:06:35Z

Given that the timings for OutlinesJSONSchema are in the tens of milliseconds, my suspicion is that the port of build_regex_from_schema to Rust in outlines-core is inefficient for some reason. Could you profile the run for JSON Schema and outlines-core to confirm this? Actually the first thing is to try is to compare the regexes that were generated by outlines and those currently generated by outlines-core.

Note that timings for this function on outlines-core are in the tens of microseconds. This is a mystery to me.

src/benchmark_lfe.py

rlouf · 2024-10-15T16:40:43Z

Ran the benchmarks locally with outlines-core==0.1.14 and the difference between Outlines and Outlines core is still mysterious (outlines core faster on regex but slower on json):

[58.33%] ··· benchmark_lfe.LMFormatEnforcerJsonSchema.time_lfe                                                                                                                                                                                                                         ok
[58.33%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     40.0±0.4μs          199±5μs        
                              gpt2                    40.8±0.9μs          216±3μs        
               NousResearch/Hermes-3-Llama-3.1-8B      192±5μs            289±1μs        
                 unsloth/gemma-2-2b-it-bnb-4bit        210±9μs            259±10μs       
             ===================================== =============== ======================

[66.67%] ··· benchmark_lfe.LMFormatEnforcerRegex.time_lfe                                                                                                                                                                                                                              ok
[66.67%] ··· ===================================== ============== ============ =========== ================ ==============
             --                                                                   regex_name                              
             ------------------------------------- -----------------------------------------------------------------------
                             model                  Phone Number      URL         GSM8K     Complex string   Long integer 
             ===================================== ============== ============ =========== ================ ==============
              NousResearch/Nous-Hermes-llama-2-7b    41.2±0.2ms     537±2ms     130±0.4ms     80.5±0.1ms     28.2±0.05ms  
                              gpt2                   12.9±0.1ms     401±5ms     204±0.9ms      486±10ms       7.74±0.2ms  
               NousResearch/Hermes-3-Llama-3.1-8B    18.9±0.2ms    4.72±0.08s    252±1ms      1.13±0.03s      27.9±0.3ms  
                 unsloth/gemma-2-2b-it-bnb-4bit       47.3±2ms     11.8±0.05s    289±7ms      2.14±0.06s      40.9±0.1ms  
             ===================================== ============== ============ =========== ================ ==============

[75.00%] ··· benchmark_outlines.OutlinesJsonSchema.time_outlines                                                                                                                                                                                                                       ok
[75.00%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     13.8±0.1ms         12.9±0.2ms      
                              gpt2                    15.9±0.1ms         17.1±0.7ms      
               NousResearch/Hermes-3-Llama-3.1-8B      81.3±1ms          75.6±0.9ms      
                 unsloth/gemma-2-2b-it-bnb-4bit        197±10ms           197±6ms        
             ===================================== =============== ======================

[83.33%] ··· benchmark_outlines.OutlinesRegex.time_outlines                                                                                                                                                                                                                            ok
[83.33%] ··· ===================================== ============== ============ ============ ================ ==============
             --                                                                   regex_name                               
             ------------------------------------- ------------------------------------------------------------------------
                             model                  Phone Number      URL         GSM8K      Complex string   Long integer 
             ===================================== ============== ============ ============ ================ ==============
              NousResearch/Nous-Hermes-llama-2-7b    81.4±0.6ms     170±2ms     8.28±0.04s      85.3±1ms       90.1±0.8ms  
                              gpt2                   112±0.8ms      235±7ms     15.3±0.05s      114±1ms         129±1ms    
               NousResearch/Hermes-3-Llama-3.1-8B     381±3ms       651±6ms     30.7±0.3s       380±4ms         426±4ms    
                 unsloth/gemma-2-2b-it-bnb-4bit       859±10ms     1.36±0.01s   1.05±0.01m      840±2ms         918±4ms    
             ===================================== ============== ============ ============ ================ ==============

[91.67%] ··· benchmark_outlines_core.OutlinesCoreJsonSchema.time_outlines_core                                                                                                                                                                                                         ok
[91.67%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     285±0.7ms           602±1ms        
                              gpt2                     403±2ms            846±2ms        
               NousResearch/Hermes-3-Llama-3.1-8B      958±20ms          1.77±0.03s      
                 unsloth/gemma-2-2b-it-bnb-4bit       1.99±0.01s         3.44±0.01s      
             ===================================== =============== ======================

[100.00%] ··· benchmark_outlines_core.OutlinesCoreRegex.time_outlines_core                                                                                                                                                                                                              ok[100.00%] ··· ===================================== ============== =========== ============ ================ ==============
              --                                                                   regex_name                              
              ------------------------------------- -----------------------------------------------------------------------
                              model                  Phone Number      URL        GSM8K      Complex string   Long integer 
              ===================================== ============== =========== ============ ================ ==============
               NousResearch/Nous-Hermes-llama-2-7b    81.5±0.2ms    143±0.3ms    5.49±0s       85.8±0.3ms      82.9±0.3ms  
                               gpt2                   100±0.3ms      189±3ms    10.7±0.01s      107±2ms         103±2ms    
                NousResearch/Hermes-3-Llama-3.1-8B     274±6ms       433±9ms    20.5±0.2s       286±7ms         284±7ms    
                  unsloth/gemma-2-2b-it-bnb-4bit       613±3ms       935±5ms    41.7±0.1s       647±6ms         634±5ms    
              ===================================== ============== =========== ============ ================ ==============

rlouf · 2024-10-16T06:07:40Z

pyproject.toml

-    "outlines==0.0.46",
-    "outlines-core==0.1.0",
+    "lm-format-enforcer==0.10.7",
+    "outlines==0.1.1",


The idea is to compare to the Numba version, can you use an earlier version?

Since we're no longer maintaining the Numba implementation of regex.py, wouldn't it make sense to reference the last benchmark run prior to replacement rather than continuously tracking it?

Outlines benchmarks: https://github.com/dottxt-ai/outlines/actions/runs/11079055001/job/30787437777

I could also perform a single run of this suite with the Numba implementation without merging it if that makes sense.

Not for now, we need the numbers for the outlines-core release. We’ll tag main once we’re happy with the setup, refer people to this tag for comparisons with Outlines and eventually remove it. Does that make sense?

Sounds good.

outlines-core doesn't have caching. I assume you'd like me to use Outlines caching with outlines-core? (for now we can just copy https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/guide.py#L76-L99)

Also let's use the latest version of outlines-core.

lapp0 · 2024-10-16T10:01:03Z

Updated to latest version of all three benchmarked packages.
Fix absurdly low runtimes

Outlines: teardown() step to clear cache
lm-format-enforcer: teardown() step to delete TokenEnforcer and its contained cache
JsonSchema: Ensure "samples" is a list, not a generator which is exhausted prior to start of measured run.

By default ASV runs warmup steps prior to the measured run, resulting in the unexpected caching and generator exhaustion described above.

Added "Upload Benchmark Results Folder" step to asv_benchmarks_pr.yaml (@rlouf should this be in asv_benchmark_main.yml as well?)
Creating a separate PR to split up Time to First Token, and Tokens Per Second

Given that the timings for OutlinesJSONSchema are in the tens of milliseconds, my suspicion is that the port of build_regex_from_schema to Rust in outlines-core is inefficient for some reason. Could you profile the run for JSON Schema and outlines-core to confirm this? Actually the first thing is to try is to compare the regexes that were generated by outlines and those currently generated by outlines-core.

Seeing more sane benchmarks locally for a small subset. Will analyze the results of latest benchmark run first to ensure this is necessary.

rlouf · 2024-10-16T13:11:27Z

A few comments:

~~Can you downgrade outlines to a version that used Numba, and use the latest version of outlines-core?~~ I pushed to your branch
~~On PRs we should use the --quick flag of asv (asv run --quick) but keep it as is when merging on main~~ I pushed to your branch
~~We need to increase the timeout for lm-format-enforcer~~ I pushed to your branch
Benchmarks are currently failing for outlines and outlines-core
Timings look much more reasonable

brandonwillard

The benchmark method names, i.e. time_{package}, seem a little redundant. The package is already given by the class name, and exactly what's being timed isn't apparent. Can we change one of those so that it clarifies exactly what is being measured?

lapp0 · 2024-10-17T09:21:20Z

Pushed a5adbe4 to fix benchmarks (sample run)

Note: time_lfe_total / time_lfe_runtime still fails due to timeout for "Simple nested schema" with unsloth/gemma-2-2b-it-bnb-4bit

Changes

Introduces new benchmarks:
- time_{package}_first_token (time to first token)
- time_{package}_runtime (time to generate all samples after first token)
- time_{package}_total (renamed time_{package}, sum of first_token and runtime)
Refactored code to make it more clean, concise, applying DRY.
Ensure samples are tokenized in setup()

Benchmarks

For NousResearch/Nous-Hermes-llama-2-7b, (Long integer, Simple nested schema)

Parameter	Method	Benchmark Outlines Core	Benchmark Outlines	Benchmark LFE
Simple nested schema	first_token	1.22s	2.82s	457μs
Simple nested schema	runtime	47.3ms	18.5ms	8.40s
Simple nested schema	total	1.28s	2.77s	8.58s
Long integer	first_token	178ms	1.11s	930μs
Long integer	runtime	5.06ms	1.96ms	38.1ms
Long integer	total	180ms	1.11s	39.6ms

Edit

Pushed c86e55d which fixes a bug resulting in total and first_token benchmarks running twice.

lapp0 · 2024-10-21T05:02:53Z

Just a heads-up: The main branch is currently pinned to outlines-core==0.1.0, which uses a different RegexGuide interface. This causes the PR benchmark tests to fail. However, after merging the benchmarks run, with the caveat that they often time out.

You can see the benchmark workflow run for asv_benchmark_main.yml here: https://github.com/lapp0/benchmarks/actions/runs/11433488618/job/31810480964

rlouf · 2024-10-21T10:31:06Z

Everything works as intended, so I will merge this PR. I will do a follow-up PR to separate the outlines and outlines-core benchmarking code: not only does it seem to introduce extra benchmarking steps, we will soon remove outlines from these benchmarks.

lapp0 force-pushed the add-100-samples branch 3 times, most recently from a199b7c to 698809b Compare October 11, 2024 17:10

rlouf marked this pull request as ready for review October 11, 2024 17:19

lapp0 force-pushed the add-100-samples branch 8 times, most recently from 91c66eb to 80b549e Compare October 13, 2024 23:44

rlouf reviewed Oct 15, 2024

View reviewed changes

src/benchmark_lfe.py Outdated Show resolved Hide resolved

rlouf reviewed Oct 16, 2024

View reviewed changes

lapp0 force-pushed the add-100-samples branch 3 times, most recently from 08f10af to 0d73c1c Compare October 16, 2024 09:21

brandonwillard reviewed Oct 16, 2024

View reviewed changes

lapp0 force-pushed the add-100-samples branch 4 times, most recently from cc7400b to c86e55d Compare October 21, 2024 04:58

rlouf mentioned this pull request Oct 21, 2024

Remove Outlines from the benchmarks #36

Closed

rlouf force-pushed the add-100-samples branch from f208928 to 1d53dbc Compare October 21, 2024 10:29

lapp0 and others added 3 commits October 21, 2024 12:29

upload benchmark results folder artifact

6b3ea1b

Do a quick benchmark run in CI

04bdfd1

Measure runtime and total time

1d53dbc

rlouf merged commit 0e02ffb into dottxt-ai:main Oct 21, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 100 Samples Per Regex / JSON Schema #35

Add 100 Samples Per Regex / JSON Schema #35

lapp0 commented Oct 11, 2024 •

edited by rlouf

Loading

rlouf Oct 15, 2024

rlouf commented Oct 15, 2024 •

edited

Loading

rlouf commented Oct 15, 2024

rlouf Oct 16, 2024

lapp0 Oct 16, 2024 •

edited

Loading

rlouf Oct 16, 2024

lapp0 Oct 16, 2024

rlouf Oct 16, 2024

lapp0 commented Oct 16, 2024 •

edited

Loading

rlouf commented Oct 16, 2024 •

edited

Loading

brandonwillard left a comment

lapp0 commented Oct 17, 2024 •

edited

Loading

lapp0 commented Oct 21, 2024 •

edited

Loading

rlouf commented Oct 21, 2024

Add 100 Samples Per Regex / JSON Schema #35

Add 100 Samples Per Regex / JSON Schema #35

Conversation

lapp0 commented Oct 11, 2024 • edited by rlouf Loading

Changes

Sample Generation Scripts

phone_number.json

url.json

gsm8k.json

complex_str.json

long_integer.json

recording_schema.json and rpg_characters.json

TODO

rlouf Oct 15, 2024

Choose a reason for hiding this comment

rlouf commented Oct 15, 2024 • edited Loading

rlouf commented Oct 15, 2024

rlouf Oct 16, 2024

Choose a reason for hiding this comment

lapp0 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

rlouf Oct 16, 2024

Choose a reason for hiding this comment

lapp0 Oct 16, 2024

Choose a reason for hiding this comment

rlouf Oct 16, 2024

Choose a reason for hiding this comment

lapp0 commented Oct 16, 2024 • edited Loading

rlouf commented Oct 16, 2024 • edited Loading

brandonwillard left a comment

Choose a reason for hiding this comment

lapp0 commented Oct 17, 2024 • edited Loading

Benchmarks

lapp0 commented Oct 21, 2024 • edited Loading

rlouf commented Oct 21, 2024

lapp0 commented Oct 11, 2024 •

edited by rlouf

Loading

`phone_number.json`

`url.json`

`gsm8k.json`

`complex_str.json`

`long_integer.json`

`recording_schema.json` and `rpg_characters.json`

rlouf commented Oct 15, 2024 •

edited

Loading

lapp0 Oct 16, 2024 •

edited

Loading

lapp0 commented Oct 16, 2024 •

edited

Loading

rlouf commented Oct 16, 2024 •

edited

Loading

lapp0 commented Oct 17, 2024 •

edited

Loading

lapp0 commented Oct 21, 2024 •

edited

Loading