-
-
Notifications
You must be signed in to change notification settings - Fork 692
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
193 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
--- | ||
draft: False | ||
date: 2024-03-08 | ||
authors: | ||
- jxnl | ||
--- | ||
|
||
# Simple Synthetic Data Generation | ||
|
||
What that people have been using instructor for is to generate synthetic data rather than extracting data itself. We can even use the J-Schemo extra fields to give specific examples to control how we generate data. | ||
|
||
Consider the example below. We'll likely generate very simple names. | ||
|
||
```python | ||
from typing import Iterable | ||
from pydantic import BaseModel | ||
import instructor | ||
from openai import OpenAI | ||
|
||
|
||
# Define the UserDetail model | ||
class UserDetail(BaseModel): | ||
name: str | ||
age: int | ||
|
||
|
||
# Patch the OpenAI client to enable the response_model functionality | ||
client = instructor.patch(OpenAI()) | ||
|
||
|
||
def generate_fake_users(count: int) -> Iterable[UserDetail]: | ||
return client.chat.completions.create( | ||
model="gpt-3.5-turbo", | ||
response_model=Iterable[UserDetail], | ||
messages=[ | ||
{"role": "user", "content": f"Generate a {count} synthetic users"}, | ||
], | ||
) | ||
|
||
|
||
for user in generate_fake_users(5): | ||
print(user) | ||
""" | ||
name='Alice' age=25 | ||
name='Bob' age=30 | ||
name='Charlie' age=35 | ||
name='David' age=40 | ||
name='Eve' age=45 | ||
""" | ||
``` | ||
|
||
## Leveraging Simple Examples | ||
|
||
We might want to set examples as part of the prompt by leveraging Pydantics configuration. We can set examples directly in the JSON scheme itself. | ||
|
||
```python | ||
from typing import Iterable | ||
from pydantic import BaseModel | ||
import instructor | ||
from openai import OpenAI | ||
|
||
|
||
# Define the UserDetail model | ||
class UserDetail(BaseModel): | ||
name: str = Field(examples=["Timothee Chalamet", "Zendaya"]) | ||
age: int | ||
|
||
|
||
# Patch the OpenAI client to enable the response_model functionality | ||
client = instructor.patch(OpenAI()) | ||
|
||
|
||
def generate_fake_users(count: int) -> Iterable[UserDetail]: | ||
return client.chat.completions.create( | ||
model="gpt-3.5-turbo", | ||
response_model=Iterable[UserDetail], | ||
messages=[ | ||
{"role": "user", "content": f"Generate a {count} synthetic users"}, | ||
], | ||
) | ||
|
||
|
||
for user in generate_fake_users(5): | ||
print(user) | ||
""" | ||
name='Timothee Chalamet' age=25 | ||
name='Zendaya' age=24 | ||
name='Keanu Reeves' age=56 | ||
name='Scarlett Johansson' age=36 | ||
name='Chris Hemsworth' age=37 | ||
""" | ||
``` | ||
|
||
By incorporating names of celebrities as examples, we have shifted towards generating synthetic data featuring well-known personalities, moving away from the simplistic, single-word names previously used. | ||
|
||
## Leveraging Complex Example | ||
|
||
To effectively generate synthetic examples with more nuance, lets upgrade to the "gpt-4-turbo-preview" model, use model level examples rather than attribute level examples: | ||
|
||
```Python | ||
import instructor | ||
|
||
from typing import Iterable | ||
from pydantic import BaseModel, Field, ConfigDict | ||
from openai import OpenAI | ||
|
||
|
||
# Define the UserDetail model | ||
class UserDetail(BaseModel): | ||
"""Old Wizards""" | ||
name: str | ||
age: int | ||
|
||
model_config = ConfigDict( | ||
json_schema_extra={ | ||
"examples": [ | ||
{"name": "Gandalf the Grey", "age": 1000}, | ||
{"name": "Albus Dumbledore", "age": 150}, | ||
] | ||
} | ||
) | ||
|
||
|
||
# Patch the OpenAI client to enable the response_model functionality | ||
client = instructor.patch(OpenAI()) | ||
|
||
|
||
def generate_fake_users(count: int) -> Iterable[UserDetail]: | ||
return client.chat.completions.create( | ||
model="gpt-4-turbo-preview", | ||
response_model=Iterable[UserDetail], | ||
messages=[ | ||
{"role": "user", "content": f"Generate `{count}` synthetic examples"}, | ||
], | ||
) | ||
|
||
|
||
for user in generate_fake_users(5): | ||
print(user) | ||
""" | ||
name='Merlin' age=196 | ||
name='Saruman the White' age=543 | ||
name='Radagast the Brown' age=89 | ||
name='Morgoth' age=901 | ||
name='Filius Flitwick' age=105 | ||
""" | ||
``` | ||
|
||
## Leveraging Descriptions | ||
|
||
By adjusting the descriptions within our Pydantic models, we can subtly influence the nature of the synthetic data generated. This method allows for a more nuanced control over the output, ensuring that the generated data aligns more closely with our expectations or requirements. | ||
|
||
For instance, specifying "Fancy French sounding names" as a description for the `name` field in our `UserDetail` model directs the generation process to produce names that fit this particular criterion, resulting in a dataset that is both diverse and tailored to specific linguistic characteristics. | ||
|
||
|
||
```python | ||
import instructor | ||
|
||
from typing import Iterable | ||
from pydantic import BaseModel, Field | ||
from openai import OpenAI | ||
|
||
|
||
# Define the UserDetail model | ||
class UserDetail(BaseModel): | ||
name: str = Field(description="Fancy French sounding names") | ||
age: int | ||
|
||
|
||
# Patch the OpenAI client to enable the response_model functionality | ||
client = instructor.patch(OpenAI()) | ||
|
||
|
||
def generate_fake_users(count: int) -> Iterable[UserDetail]: | ||
return client.chat.completions.create( | ||
model="gpt-3.5-turbo", | ||
response_model=Iterable[UserDetail], | ||
messages=[ | ||
{"role": "user", "content": f"Generate `{count}` synthetic users"}, | ||
], | ||
) | ||
|
||
|
||
for user in generate_fake_users(5): | ||
print(user) | ||
""" | ||
name='Jean' age=25 | ||
name='Claire' age=30 | ||
name='Pierre' age=22 | ||
name='Marie' age=27 | ||
name='Luc' age=35 | ||
""" | ||
``` |