forked from instructor-ai/instructor
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
2,245 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Concepts | ||
|
||
Welcome to the Concepts section of Instructor documentation. This section provides an in-depth exploration of key ideas and techniques that form the foundation of working with structured outputs in AI applications using Instructor. | ||
|
||
## Introduction | ||
|
||
Instructor is designed to simplify the process of extracting structured data from large language models (LLMs). By leveraging the power of Pydantic and OpenAI's function calling API, Instructor enables developers to create robust, type-safe applications that can efficiently process and validate AI-generated outputs. | ||
|
||
In this section, we'll cover a range of concepts that are crucial for understanding and effectively using Instructor. Whether you're new to the library or looking to deepen your knowledge, these guides will provide valuable insights into the core principles and advanced features of Instructor. | ||
|
||
## Key Concepts | ||
|
||
Here's an overview of the concepts we'll explore: | ||
|
||
1. [Aliases](alias.md): Learn how to use aliases to customize field names in your Pydantic models. | ||
|
||
2. [Caching](caching.md): Discover techniques for improving performance through effective data management and caching strategies. | ||
|
||
3. [Templating](templating.md): Explore Jinja templating for dynamic and efficient prompt management. | ||
|
||
4. [Type Adapter](typeadapter.md): Understand Pydantic's Type Adapter for enhanced data validation and parsing. | ||
|
||
5. [TypedDicts](typeddicts.md): Learn about using TypedDicts for structured data handling with OpenAI's API. | ||
|
||
6. [Types](types.md): Dive into the various data types supported by Instructor, from simple to complex. | ||
|
||
7. [Union](union.md): Explore the use of Union types for flexible and dynamic operations in your models. | ||
|
||
8. [Usage](usage.md): Get insights on handling non-streaming requests and managing token usage with the OpenAI API. | ||
|
||
Each of these concepts plays a crucial role in building efficient, type-safe, and robust applications with Instructor. By mastering these ideas, you'll be well-equipped to tackle complex data extraction and validation tasks in your AI-powered projects. | ||
|
||
We encourage you to explore these concepts in depth and see how they can be applied to your specific use cases. Remember, the power of Instructor lies in its ability to combine these concepts seamlessly, allowing you to create sophisticated applications with ease. | ||
|
||
Happy learning, and enjoy your journey through the world of structured outputs with Instructor! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
import os | ||
import asyncio | ||
import yaml | ||
from typing import Generator, Tuple, Dict, Optional, List | ||
from openai import AsyncOpenAI | ||
import typer | ||
from rich.console import Console | ||
from rich.progress import Progress | ||
import hashlib | ||
from asyncio import as_completed | ||
import tenacity | ||
|
||
console = Console() | ||
|
||
|
||
def traverse_docs( | ||
root_dir: str = "docs", | ||
) -> Generator[Tuple[str, str, str], None, None]: | ||
""" | ||
Recursively traverse the docs folder and yield the path, content, and content hash of each file. | ||
Args: | ||
root_dir (str): The root directory to start traversing from. Defaults to 'docs'. | ||
Yields: | ||
Tuple[str, str, str]: A tuple containing the relative path from 'docs', the file content, and the content hash. | ||
""" | ||
for root, _, files in os.walk(root_dir): | ||
for file in files: | ||
if file.endswith(".md"): # Assuming we're only interested in Markdown files | ||
file_path = os.path.join(root, file) | ||
relative_path = os.path.relpath(file_path, root_dir) | ||
|
||
with open(file_path, "r", encoding="utf-8") as f: | ||
content = f.read() | ||
|
||
content_hash = hashlib.md5(content.encode()).hexdigest() | ||
yield relative_path, content, content_hash | ||
|
||
|
||
@tenacity.retry( | ||
stop=tenacity.stop_after_attempt(3), | ||
wait=tenacity.wait_exponential(multiplier=1, min=4, max=10), | ||
retry=tenacity.retry_if_exception_type(Exception), | ||
before_sleep=lambda retry_state: console.print( | ||
f"[yellow]Retrying summarization... (Attempt {retry_state.attempt_number})[/yellow]" | ||
), | ||
) | ||
async def summarize_content(client: AsyncOpenAI, path: str, content: str) -> str: | ||
""" | ||
Summarize the content of a file with retry logic. | ||
Args: | ||
client (AsyncOpenAI): The AsyncOpenAI client. | ||
path (str): The path of the file. | ||
content (str): The content of the file. | ||
Returns: | ||
str: A summary of the content. | ||
Raises: | ||
Exception: If all retry attempts fail. | ||
""" | ||
try: | ||
response = await client.chat.completions.create( | ||
model="gpt-4o", | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant that summarizes text.", | ||
}, | ||
{"role": "user", "content": content}, | ||
{ | ||
"role": "user", | ||
"content": "Please summarize the content in a few sentences so they can be used for SEO. Include core ideas, objectives, and important details and key points and key words", | ||
}, | ||
], | ||
max_tokens=4000, | ||
) | ||
return response.choices[0].message.content | ||
except Exception as e: | ||
console.print(f"[bold red]Error summarizing {path}: {str(e)}[/bold red]") | ||
raise # Re-raise the exception to trigger a retry | ||
|
||
|
||
async def generate_sitemap( | ||
root_dir: str, | ||
output_file: str, | ||
api_key: Optional[str] = None, | ||
max_concurrency: int = 5, | ||
) -> None: | ||
""" | ||
Generate a sitemap from the given root directory. | ||
Args: | ||
root_dir (str): The root directory to start traversing from. | ||
output_file (str): The output file to save the sitemap. | ||
api_key (Optional[str]): The OpenAI API key. If not provided, it will be read from the OPENAI_API_KEY environment variable. | ||
max_concurrency (int): The maximum number of concurrent tasks. Defaults to 5. | ||
""" | ||
client = AsyncOpenAI(api_key=api_key) | ||
|
||
# Load existing sitemap if it exists | ||
existing_sitemap: Dict[str, Dict[str, str]] = {} | ||
if os.path.exists(output_file): | ||
with open(output_file, "r", encoding="utf-8") as sitemap_file: | ||
existing_sitemap = yaml.safe_load(sitemap_file) or {} | ||
|
||
sitemap_data: Dict[str, Dict[str, str]] = {} | ||
|
||
async def process_file( | ||
path: str, content: str, content_hash: str | ||
) -> Tuple[str, Dict[str, str]]: | ||
if ( | ||
path in existing_sitemap | ||
and existing_sitemap[path].get("hash") == content_hash | ||
): | ||
return path, existing_sitemap[path] | ||
try: | ||
summary = await summarize_content(client, path, content) | ||
return path, {"summary": summary, "hash": content_hash} | ||
except Exception as e: | ||
console.print( | ||
f"[bold red]Failed to summarize {path} after multiple attempts: {str(e)}[/bold red]" | ||
) | ||
return path, {"summary": "Failed to generate summary", "hash": content_hash} | ||
|
||
files_to_process: List[Tuple[str, str, str]] = list(traverse_docs(root_dir)) | ||
total_files = len(files_to_process) | ||
|
||
with Progress() as progress: | ||
task = progress.add_task("[green]Processing files...", total=total_files) | ||
|
||
semaphore = asyncio.Semaphore(max_concurrency) | ||
|
||
async def bounded_process_file(*args): | ||
async with semaphore: | ||
return await process_file(*args) | ||
|
||
tasks = [ | ||
bounded_process_file(path, content, content_hash) | ||
for path, content, content_hash in files_to_process | ||
] | ||
|
||
for completed_task in as_completed(tasks): | ||
path, result = await completed_task | ||
sitemap_data[path] = result | ||
progress.update(task, advance=1) | ||
|
||
# Save intermediate results | ||
with open(output_file, "w", encoding="utf-8") as sitemap_file: | ||
yaml.dump(sitemap_data, sitemap_file, default_flow_style=False) | ||
|
||
console.print( | ||
f"[bold green]Sitemap has been generated and saved to {output_file}[/bold green]" | ||
) | ||
|
||
|
||
app = typer.Typer() | ||
|
||
|
||
@app.command() | ||
def main( | ||
root_dir: str = typer.Option("docs", help="Root directory to traverse"), | ||
output_file: str = typer.Option("sitemap.yaml", help="Output file for the sitemap"), | ||
api_key: Optional[str] = typer.Option(None, help="OpenAI API key"), | ||
max_concurrency: int = typer.Option(5, help="Maximum number of concurrent tasks"), | ||
): | ||
""" | ||
Generate a sitemap from the given root directory. | ||
""" | ||
asyncio.run(generate_sitemap(root_dir, output_file, api_key, max_concurrency)) | ||
|
||
|
||
if __name__ == "__main__": | ||
app() |
Oops, something went wrong.