Skip to content

Commit

Permalink
Implementing the Distillation Decorator prototype (#118)
Browse files Browse the repository at this point in the history
  • Loading branch information
jxnl authored Oct 15, 2023
1 parent 41c4636 commit 9fe3c83
Show file tree
Hide file tree
Showing 9 changed files with 610 additions and 1 deletion.
110 changes: 110 additions & 0 deletions docs/blog/posts/distilation-part1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
draft: False
date: 2023-10-17
tags:
- python
- distilation
- function calling
- tinetuning
---

# Introduction to `Instructions` from `Instructor`, finetuning from Python functions.

The core philosophy with the `instructor` library is to make language models backwards compatible with existing code. By adding Pydantic in the mix we're able to easily work with LLMs without much worry.

However, many times, a single function isn't just one LLM call. After the results are returned theres [validation](/docs/validation.md), some additional processing and formatting before you `return` the result.

But the promise of LLMs is that they can do all of this in one go. So how do we get there? Finetuning end to end is a great tool for enhancing language models. Instructor uses type hints via Pydantic to maintain backward compatibility. Distillation focuses on fine-tuning language models to imitate specific functions.

## Challenges in Fine-tuning

Fine-tuning a model isn't as straightforward as just writing `def f(a, b): return a * b` to teach a model three-digit multiplication. Substantial data preparation is required, making logging for data collection cumbersome. Luckily OpenAI not only provides a fine-tuning script but also one for function calling which simplies the process backed by structured outputs! More over, the finetune allows us to avoid passing the schema to the model, resulting in less tokens being used!

## Role of Instructor in Easing the Process

The feature `from instructor import Instructions` simplifies this. It decorates Python functions that return Pydantic objects, automatically creating a fine-tuning dataset when provided a handler for logging. This allows you to finetune a model to imitate a function's behavior.

## How to Use Instructor's Distillation Feature

Here's an example to illustrate its use:

```python
import logging
import random
from pydantic import BaseModel
from instructor import Instructions

logging.basicConfig(level=logging.INFO)

instructions = Instructions(
name="three_digit_multiply",
finetune_format="messages",
log_handlers=[logging.FileHandler("math_finetunes.jsonl")]
)

class Multiply(BaseModel):
a: int
b: int
result: int

@instructions.distil
def fn(a: int, b: int) -> Multiply:
resp = a * b
return Multiply(a=a, b=b, result=resp)

for _ in range(10):
a = random.randint(100, 999)
b = random.randint(100, 999)
print(fn(a, b))
```

## Logging output

```python
{
"messages": [
{"role": "system", "content": 'Predict the results of this function: ...'},
{"role": "user", "content": 'Return fn(133, b=539)'},
{"role": "assistant",
"function_call":
{
"name": "Multiply",
"arguments": '{"a":133,"b":539,"result":89509}'
}
}
],
"functions": [
{"name": "Multiply", "description": "Correctly extracted `Multiply`..."}
]
}
```

## Why Instructor and Distillation are Useful

Many systems are not as simple as a single `openai.ChatCompletion.create` call, instead we often create objects, do additional processing, validation, error correction, and then return the result. This is a lot of work, and it's easy to make mistakes. Instructor's `distil` feature makes this process easier by:

1. Streamlines complex functions with validations, making them more efficient.
2. Facilitates the integration of classical machine learning with language models.

By understanding and leveraging these capabilities, you can create powerful, fine-tuned language models with ease. To learn more about how to use the file to finetune a model, check out the [cli](/docs/cli/finetune.md)

## Next Steps

This post is mostly a peek of what I've been working on this week. Once we have a model trained I'd like to be able to dynamically swap the implemetnation of a function with a model. This would allow us to do things like:

```python
from instructor import Instructions

instructions = Instructions(
name="three_digit_multiply",
)

@instructions.distil(model='gpt-3.5-turbo:finetuned', swap=True)
def fn(a: int, b: int) -> Multiply:
resp = a + b
return Multiply(a=a, b=b, result=resp)
```

Now we can swap out the implementation of `fn` with calling the finetuned model, since we know the response type is still `Multiply` we can use instructor behind the scenes and have it be backwards compatible with the existing code.

Now if you're thinking wow, I'd love a backend service to do this for continously, you're in luck! Please check out the survey at [useinstructor.com](https://useinstructor.com) and let us know who you are.
93 changes: 93 additions & 0 deletions docs/distilation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Distilling python functions into LLM

`Instructions` from the `Instructor` library offers a seamless way to make language models backward compatible with existing Python functions. By employing Pydantic type hints, it not only ensures compatibility but also facilitates fine-tuning language models to emulate these functions end-to-end.

## The Challenges in Function-Level Fine-Tuning

Unlike simple script-level fine-tuning, replicating the behavior of a Python function in a language model involves intricate data preparation. For instance, teaching a model to execute three-digit multiplication is not as trivial as implementing `def f(a, b): return a * b`. OpenAI's fine-tuning script coupled with their function calling utility provides a structured output, thereby simplifying the data collection process. Additionally, this eliminates the need for passing the schema to the model, thus conserving tokens.

## The Role of `Instructions` in Simplifying the Fine-Tuning Process

By using `Instructions`, you can annotate a Python function that returns a Pydantic object, thereby automating the dataset creation for fine-tuning. A handler for logging is all that's needed to build this dataset.

## How to Implement `Instructions` in Your Code

Here's a step-by-step example:

```python
import logging
from pydantic import BaseModel
from instructor import Instructions

logging.basicConfig(level=logging.INFO)

instructions = Instructions(
name="three_digit_multiply",
finetune_format="messages",
log_handlers=[logging.FileHandler("math_finetunes.jsonl")]
)

class Multiply(BaseModel):
a: int
b: int
result: int

@instructions.distil
def fn(a: int, b: int) -> Multiply:
resp = a + b
return Multiply(a=a, b=b, result=resp)
```

## Custom Log Handlers for Data Collection

While the example above uses a file-based log handler, you can easily extend this to custom log handlers for different storage solutions. The following skeleton code illustrates how to create a log handler for an S3 bucket:

```python
import logging
import boto3

class S3LogHandler(logging.Handler):
def __init__(self, bucket, key):
logging.Handler.__init__(self)
self.bucket = bucket
self.key = key

def emit(self, record):
s3 = boto3.client('s3')
log_entry = self.format(record)
s3.put_object(Body=log_entry, Bucket=self.bucket, Key=self.key)
```

You can add this custom log handler to `Instructions` as shown:

```python
instructions = Instructions(
name="three_digit_multiply",
finetune_format="messages",
log_handlers=[S3LogHandler(bucket='your-bucket', key='your-key')]
)
```

## Why `Instructions` is a Game-Changer

1. It condenses complex, multi-step functions with validations into a single fine-tuned model.
2. It integrates language models with classical machine learning seamlessly.

## Next Steps and Future Scope

Going forward, the aim is to dynamically switch between the Python function and its fine-tuned model representation. This could look like:

```python
from instructor import Instructions

instructions = Instructions(
name="three_digit_multiply",
)

@instructions.distil(model='gpt-3.5-turbo:finetuned', swap=True)
def fn(a: int, b: int) -> Multiply:
resp = a + b
return Multiply(a=a, b=b, result=resp)
```

This dynamic switching retains backward compatibility while improving efficiency, opening up exciting avenues for future developments.
10 changes: 10 additions & 0 deletions examples/distilations/math_finetunes.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(540, b=677, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 540,\n \"b\": 677,\n \"result\": 1217\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(798, b=534, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 798,\n \"b\": 534,\n \"result\": 1332\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(608, b=669, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 608,\n \"b\": 669,\n \"result\": 1277\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(982, b=768, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 982,\n \"b\": 768,\n \"result\": 1750\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(994, b=682, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 994,\n \"b\": 682,\n \"result\": 1676\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(467, b=754, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 467,\n \"b\": 754,\n \"result\": 1221\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(497, b=364, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 497,\n \"b\": 364,\n \"result\": 861\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(840, b=821, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 840,\n \"b\": 821,\n \"result\": 1661\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(646, b=835, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 646,\n \"b\": 835,\n \"result\": 1481\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
{"messages": [{"role": "system", "content": "Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Multiply\n\"\"\"\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n\"\"\""}, {"role": "user", "content": "Return `fn(926, b=196, c=\"hello\")`"}, {"role": "assistant", "function_call": {"name": "Multiply", "arguments": "{\n \"a\": 926,\n \"b\": 196,\n \"result\": 1122\n}"}}], "functions": [{"name": "Multiply", "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": {"properties": {"a": {"type": "integer"}, "b": {"type": "integer"}, "result": {"description": "The result of the multiplication", "type": "integer"}}, "required": ["a", "b", "result"], "type": "object"}}]}
79 changes: 79 additions & 0 deletions examples/distilations/three_digit_mul.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import logging

from pydantic import BaseModel, Field
from instructor import Instructions

logging.basicConfig(level=logging.INFO)

# Usage
instructions = Instructions(
name="three_digit_multiply",
finetune_format="messages",
log_handlers=[
logging.FileHandler("math_finetunes.jsonl"),
],
)


class Multiply(BaseModel):
a: int
b: int
result: int = Field(..., description="The result of the multiplication")


@instructions.distil
def fn(a: int, b: int, c: str) -> Multiply:
"""_summary_
Args:
a (int): _description_
b (int): _description_
c (str): _description_
Returns:
Response: _description_
"""
resp = a + b
return Multiply(a=a, b=b, result=resp)


if __name__ == "__main__":
import random

# A log will look like this:
log_line = {
"messages": [
{
"role": "system",
"content": 'Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Response\n"""\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n"""',
},
{"role": "user", "content": 'Return fn(133, b=539, c="hello")'},
{
"role": "assistant",
"function_call": {
"name": "Response",
"arguments": '{"a":133,"b":539,"result":672}',
},
},
],
"functions": [
{
"name": "Response",
"description": "Correctly extracted `Response` with all the required parameters with correct types",
"parameters": {
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"},
"result": {"type": "integer"},
},
"required": ["a", "b", "result"],
"type": "object",
},
}
],
}

for _ in range(10):
a = random.randint(100, 999)
b = random.randint(100, 999)
print("returning", fn(a, b=b, c="hello"))
3 changes: 3 additions & 0 deletions instructor/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .function_calls import OpenAISchema, openai_function, openai_schema
from .distil import FinetuneFormat, Instructions
from .dsl import MultiTask, Maybe, llm_validator, CitationMixin
from .patch import patch, unpatch

Expand All @@ -11,5 +12,7 @@
"openai_schema",
"patch",
"llm_validator",
"FinetuneFormat",
"Instructions",
"unpatch",
]
Loading

0 comments on commit 9fe3c83

Please sign in to comment.