Skip to content

Commit

Permalink
Updating README, bumping version
Browse files Browse the repository at this point in the history
  • Loading branch information
noamgat committed Oct 17, 2023
1 parent b843a1e commit 05a81c5
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 9 deletions.
29 changes: 21 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,34 +19,44 @@ This project solves the issues by filtering the tokens that the language model i
## Installation
```pip install lm-format-enforcer```

## Simple example
## Basic Tutorial
```python
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser, generate_enforced
from lmformatenforcer import JsonSchemaParser, build_transformers_prefix_allowed_tokens_fn
from transformers import pipeline

class AnswerFormat(BaseModel):
first_name: str
last_name: str
year_of_birth: int
num_seasons_in_nba: int

question = f'Please give me information about Michael Jordan. You MUST answer using the following json schema: {AnswerFormat.schema_json()}'
# Create a transformers pipeline
hf_pipeline = pipeline('text-generation', model='meta-llama/Llama-2-7b-hf')
prompt = f'Here is information about Michael Jordan in the following json schema: {AnswerFormat.schema_json()} :\n'

# Create a character level parser and build a transformers prefix function from it
parser = JsonSchemaParser(AnswerFormat.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(hf_pipeline.tokenizer, parser)

# Call the pipeline with the prefix function
output_dict = hf_pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)

# Call generate_enforced(model, tokenizer, parser, ...) instead of model.generate(...):
inputs = tokenizer([question], return_tensors='pt', add_special_tokens=False, return_token_type_ids=False).to(device)
result = generate_enforced(model, tokenizer, parser, inputs=inputs)
# Extract the results
result = output_dict[0]['generated_text'][len(prompt):]
print(result)
# {'first_name': 'Michael', 'last_name': 'Jordan', 'year_of_birth': 1963, 'num_seasons_in_nba': 15}
```

## Capabilities / Advantages

- Works with any Python language model and tokenizer. Already supports transformers and LangChain. Can be adapted to others.
- Supports batched generation and beam searches - each input / beam can have different tokens filtered at every timestep
- Supports both JSON Schema (strong) and Regular Expression (partial) formats
- Supports both JSON Schema and Regular Expression formats
- Supports both required and optional fields in JSON schemas
- Supports nested fields, arrays and dictionaries in JSON schemas
- Gives the language model freedom to control whitespacing and field ordering in JSON schemas, reducing hallucinations
- Gives the language model freedom to control whitespacing and field ordering in JSON schemas, reducing hallucinations.
- Does not modify the high level loop of transformers API, so can be used in any scenario.

## Detailed example

Expand All @@ -57,6 +67,9 @@ We created a Google Colab Notebook which contains a full example of how to use t
</a>

You can also [view the notebook in GitHub](https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_llama2_enforcer.ipynb).

For the different ways to integrate with huggingface transformers, see the [unit tests](https://github.com/noamgat/lm-format-enforcer/blob/main/tests/test_transformerenforcer.py).

## How does it work?

The library works by combining a character level parser and a tokenizer prefix tree into a smart token filtering mechanism.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "lm-format-enforcer"
version = "0.3.1"
version = "0.3.2"
description = "Enforce the output format (JSON Schema, Regex etc) of a language model"
authors = ["Noam Gat <[email protected]>"]
license = "MIT"
Expand Down

0 comments on commit 05a81c5

Please sign in to comment.