improve docs

raghavpillai · Oct 4, 2024 · 67dbe27 · 67dbe27
1 parent 1827132
commit 67dbe27
Show file tree

Hide file tree

Showing 6 changed files with 2,245 additions and 87 deletions.
diff --git a/docs/blog/index.md b/docs/blog/index.md
@@ -6,26 +6,40 @@ If you want to get updates on new features and tips on how to use Instructor, yo
 
 ## Advanced Topics
 
-1. [What is Query Understanding, how does it go beyond embeddings?](posts/rag-and-beyond.md)
-2. [How can one achieve GPT-4 level summaries using GPT-3.5-turbo?](posts/chain-of-density.md)
-3. [What are the basics of Guardrails and Validation in AI models?](posts/validation-part1.md)
-4. [How does one validate citations in AI-generated content?](posts/citations.md)
-5. [What are the methods and benefits of fine-tuning and distillation in AI models?](posts/distilation-part1.md)
-
-## Learning Python
-
-- [How can I effectively cache my functions in Python?](posts/caching.md)
-- [What are the fundamentals of batch processing with async in Python?](posts/learn-async.md)
-- [How can I stream models to improve latency?](posts/generator.md)
-
-## Integrations
-
-- [Ollama](./../hub/ollama.md)
-- [llama-cpp-python](./../hub/llama-cpp-python.md)
-- [Anyscale](./../hub/anyscale.md)
-- [Together Compute](./../hub/together.md)
-
-## Media
-
-- [Course: Structured Outputs w/ Instructor](https://www.wandb.courses/courses/steering-language-models?x=1)
-- [Keynote: Pydantic is all you need](posts/aisummit-2023.md)
+1. [Query Understanding: Beyond Embeddings](posts/rag-and-beyond.md)
+2. [Achieving GPT-4 Level Summaries with GPT-3.5-turbo](posts/chain-of-density.md)
+3. [Basics of Guardrails and Validation in AI Models](posts/validation-part1.md)
+4. [Validating Citations in AI-Generated Content](posts/citations.md)
+5. [Fine-tuning and Distillation in AI Models](posts/distilation-part1.md)
+6. [Enhancing OpenAI Client Observability with LangSmith](posts/langsmith.md)
+7. [Logfire Integration with Pydantic](posts/logfire.md)
+
+## AI Development and Optimization
+
+- [Effective Function Caching in Python](posts/caching.md)
+- [Fundamentals of Batch Processing with Async in Python](posts/learn-async.md)
+- [Streaming Models to Improve Latency](posts/generator.md)
+- [Using OpenAI's Batch API for Large-Scale Synthetic Data Generation](../examples/batch_job_oai.md)
+- [Implementing Bulk Classification with User-Provided Tags](../examples/bulk_classification.md)
+- [Utilizing GPT-4 Vision API for Ad Copy from Product Images](../examples/image_to_ad_copy.md)
+
+## Language Models and Prompting Techniques
+
+- [Least-to-Most Prompting Technique for LLMs](../prompting/decomposition/least_to_most.md)
+- [Chain of Verification (CoVe) Method for Improving LLM Accuracy](../prompting/self_criticism/chain_of_verification.md)
+- [Cumulative Reasoning to Enhance Model Performance](../prompting/self_criticism/cumulative_reason.md)
+- [Reverse Chain of Thought (RCoT) Method for Logical Consistency](../prompting/self_criticism/reversecot.md)
+
+## Integrations and Tools
+
+- [Ollama Integration](../hub/ollama.md)
+- [llama-cpp-python Integration](../hub/llama-cpp-python.md)
+- [Anyscale Integration](../hub/anyscale.md)
+- [Together Compute Integration](../hub/together.md)
+- [Extracting Data into Pandas DataFrame using GPT-3.5 Turbo](../hub/pandas_df.md)
+- [Implementing Streaming Partial Responses with Field-Level Streaming](../hub/partial_streaming.md)
+
+## Media and Resources
+
+- [Course: Structured Outputs with Instructor](https://www.wandb.courses/courses/steering-language-models?x=1)
+- [Keynote: Pydantic is All You Need](posts/aisummit-2023.md)
diff --git a/docs/concepts/index.md b/docs/concepts/index.md
@@ -0,0 +1,35 @@
+# Concepts
+
+Welcome to the Concepts section of Instructor documentation. This section provides an in-depth exploration of key ideas and techniques that form the foundation of working with structured outputs in AI applications using Instructor.
+
+## Introduction
+
+Instructor is designed to simplify the process of extracting structured data from large language models (LLMs). By leveraging the power of Pydantic and OpenAI's function calling API, Instructor enables developers to create robust, type-safe applications that can efficiently process and validate AI-generated outputs.
+
+In this section, we'll cover a range of concepts that are crucial for understanding and effectively using Instructor. Whether you're new to the library or looking to deepen your knowledge, these guides will provide valuable insights into the core principles and advanced features of Instructor.
+
+## Key Concepts
+
+Here's an overview of the concepts we'll explore:
+
+1. [Aliases](alias.md): Learn how to use aliases to customize field names in your Pydantic models.
+
+2. [Caching](caching.md): Discover techniques for improving performance through effective data management and caching strategies.
+
+3. [Templating](templating.md): Explore Jinja templating for dynamic and efficient prompt management.
+
+4. [Type Adapter](typeadapter.md): Understand Pydantic's Type Adapter for enhanced data validation and parsing.
+
+5. [TypedDicts](typeddicts.md): Learn about using TypedDicts for structured data handling with OpenAI's API.
+
+6. [Types](types.md): Dive into the various data types supported by Instructor, from simple to complex.
+
+7. [Union](union.md): Explore the use of Union types for flexible and dynamic operations in your models.
+
+8. [Usage](usage.md): Get insights on handling non-streaming requests and managing token usage with the OpenAI API.
+
+Each of these concepts plays a crucial role in building efficient, type-safe, and robust applications with Instructor. By mastering these ideas, you'll be well-equipped to tackle complex data extraction and validation tasks in your AI-powered projects.
+
+We encourage you to explore these concepts in depth and see how they can be applied to your specific use cases. Remember, the power of Instructor lies in its ability to combine these concepts seamlessly, allowing you to create sophisticated applications with ease.
+
+Happy learning, and enjoy your journey through the world of structured outputs with Instructor!
diff --git a/docs/examples/index.md b/docs/examples/index.md
@@ -4,26 +4,32 @@ Welcome to our collection of cookbooks showcasing the power of structured output
 
 ## Quick Links
 
-1. [Classifying using enums](classification.md)
-2. [Implementing AI self-assessment](self_critique.md)
-3. [Classifying in batch](batch_classification.md)
-4. [Retrieving exact citations](exact_citations.md)
-5. [Segmenting search queries](search.md)
-6. [Generating knowledge graphs](knowledge_graph.md)
-7. [Decomposing complex queries](planning-tasks.md)
-8. [Extracting and resolving entities](entity_resolution.md)
-9. [Sanitizing Personally Identifiable Information](pii.md)
-10. [Generating action items and dependencies](../hub/action_items.md)
-11. [Enabling OpenAI's moderation](moderation.md)
-12. [Extracting tables using GPT-Vision](extracting_tables.md)
-13. [Generating advertising copy from images](image_to_ad_copy.md)
-14. [Using local models from Ollama](ollama.md)
-15. [Storing responses in a database](sqlmodel.md)
-16. [Segmenting documents using LLMs](document_segmentation.md)
-17. [Saving API costs with OpenAI's Batch API](batch_job_oai.md)
-18. [Using groqcloud api](groq.md)
-19. [Using Mistral/Mixtral](mistral.md)
-20. [Working with Multi-Modal data with Gemini](multi_modal_gemini.md)
+1. [Enum-Based Classification](classification.md): Implement structured classification using Python enums with AI models.
+2. [AI Self-Assessment and Correction](self_critique.md): Explore techniques for AI models to evaluate and improve their own outputs.
+3. [Efficient Batch Classification](batch_classification.md): Process multiple items simultaneously for improved performance.
+4. [Precise Citation Extraction](exact_citations.md): Accurately retrieve and format citations from text using AI.
+5. [Search Query Segmentation](search.md): Break down complex search queries into structured components for better understanding.
+6. [Dynamic Knowledge Graph Generation](knowledge_graph.md): Create visual representations of information relationships using AI.
+7. [Complex Query Decomposition](planning-tasks.md): Break down intricate queries into manageable subtasks for thorough analysis.
+8. [Entity Extraction and Resolution](entity_resolution.md): Identify and disambiguate named entities in text.
+9. [PII Sanitization](pii.md): Detect and redact sensitive personal information from text data.
+10. [Action Item and Dependency Extraction](../hub/action_items.md): Generate structured task lists and relationships from meeting transcripts.
+11. [OpenAI Content Moderation Integration](moderation.md): Implement content filtering using OpenAI's moderation API.
+12. [Table Extraction with GPT-Vision](extracting_tables.md): Convert image-based tables into structured data using AI vision capabilities.
+13. [AI-Powered Ad Copy Generation from Images](image_to_ad_copy.md): Create compelling advertising text based on visual content.
+14. [Local AI with Ollama Integration](ollama.md): Utilize open-source language models for on-device processing.
+15. [Database Integration with SQLModel](sqlmodel.md): Seamlessly store AI-generated responses in SQL databases.
+16. [LLM-Based Document Segmentation](document_segmentation.md): Intelligently divide long documents into meaningful sections.
+17. [Cost Optimization with OpenAI's Batch API](batch_job_oai.md): Reduce API costs by processing multiple requests efficiently.
+18. [Groq Cloud API Integration](groq.md): Leverage Groq's high-performance AI inference platform.
+19. [Mistral and Mixtral Model Usage](mistral.md): Implement state-of-the-art open-source language models in your projects.
+20. [Multi-Modal AI with Gemini](multi_modal_gemini.md): Process and analyze text, images, and other data types simultaneously.
+21. [IBM watsonx.ai Integration](watsonx.md): Utilize IBM's enterprise AI platform for advanced language processing tasks.
+22. [Receipt Information Extraction with GPT-4 Vision](extracting_receipts.md): Extract structured data from receipt images using advanced AI vision capabilities.
+23. [Slide Content Extraction with GPT-4 Vision](extract_slides.md): Convert presentation slide images into structured, analyzable text data.
+24. [Few-Shot Learning with Examples](examples.md): Improve AI model performance by providing contextual examples in prompts.
+25. [Local Classification without API](local_classification.md): Perform text classification tasks locally without relying on external API calls.
+
 
 ## Subscribe to our Newsletter for Updates and Tips
 

diff --git a/make_sitemap.py b/make_sitemap.py
@@ -0,0 +1,176 @@
+import os
+import asyncio
+import yaml
+from typing import Generator, Tuple, Dict, Optional, List
+from openai import AsyncOpenAI
+import typer
+from rich.console import Console
+from rich.progress import Progress
+import hashlib
+from asyncio import as_completed
+import tenacity
+
+console = Console()
+
+
+def traverse_docs(
+    root_dir: str = "docs",
+) -> Generator[Tuple[str, str, str], None, None]:
+    """
+    Recursively traverse the docs folder and yield the path, content, and content hash of each file.
+
+    Args:
+        root_dir (str): The root directory to start traversing from. Defaults to 'docs'.
+
+    Yields:
+        Tuple[str, str, str]: A tuple containing the relative path from 'docs', the file content, and the content hash.
+    """
+    for root, _, files in os.walk(root_dir):
+        for file in files:
+            if file.endswith(".md"):  # Assuming we're only interested in Markdown files
+                file_path = os.path.join(root, file)
+                relative_path = os.path.relpath(file_path, root_dir)
+
+                with open(file_path, "r", encoding="utf-8") as f:
+                    content = f.read()
+
+                content_hash = hashlib.md5(content.encode()).hexdigest()
+                yield relative_path, content, content_hash
+
+
+@tenacity.retry(
+    stop=tenacity.stop_after_attempt(3),
+    wait=tenacity.wait_exponential(multiplier=1, min=4, max=10),
+    retry=tenacity.retry_if_exception_type(Exception),
+    before_sleep=lambda retry_state: console.print(
+        f"[yellow]Retrying summarization... (Attempt {retry_state.attempt_number})[/yellow]"
+    ),
+)
+async def summarize_content(client: AsyncOpenAI, path: str, content: str) -> str:
+    """
+    Summarize the content of a file with retry logic.
+
+    Args:
+        client (AsyncOpenAI): The AsyncOpenAI client.
+        path (str): The path of the file.
+        content (str): The content of the file.
+
+    Returns:
+        str: A summary of the content.
+
+    Raises:
+        Exception: If all retry attempts fail.
+    """
+    try:
+        response = await client.chat.completions.create(
+            model="gpt-4o",
+            messages=[
+                {
+                    "role": "system",
+                    "content": "You are a helpful assistant that summarizes text.",
+                },
+                {"role": "user", "content": content},
+                {
+                    "role": "user",
+                    "content": "Please summarize the content in a few sentences so they can be used for SEO. Include core ideas, objectives, and important details and key points and key words",
+                },
+            ],
+            max_tokens=4000,
+        )
+        return response.choices[0].message.content
+    except Exception as e:
+        console.print(f"[bold red]Error summarizing {path}: {str(e)}[/bold red]")
+        raise  # Re-raise the exception to trigger a retry
+
+
+async def generate_sitemap(
+    root_dir: str,
+    output_file: str,
+    api_key: Optional[str] = None,
+    max_concurrency: int = 5,
+) -> None:
+    """
+    Generate a sitemap from the given root directory.
+
+    Args:
+        root_dir (str): The root directory to start traversing from.
+        output_file (str): The output file to save the sitemap.
+        api_key (Optional[str]): The OpenAI API key. If not provided, it will be read from the OPENAI_API_KEY environment variable.
+        max_concurrency (int): The maximum number of concurrent tasks. Defaults to 5.
+    """
+    client = AsyncOpenAI(api_key=api_key)
+
+    # Load existing sitemap if it exists
+    existing_sitemap: Dict[str, Dict[str, str]] = {}
+    if os.path.exists(output_file):
+        with open(output_file, "r", encoding="utf-8") as sitemap_file:
+            existing_sitemap = yaml.safe_load(sitemap_file) or {}
+
+    sitemap_data: Dict[str, Dict[str, str]] = {}
+
+    async def process_file(
+        path: str, content: str, content_hash: str
+    ) -> Tuple[str, Dict[str, str]]:
+        if (
+            path in existing_sitemap
+            and existing_sitemap[path].get("hash") == content_hash
+        ):
+            return path, existing_sitemap[path]
+        try:
+            summary = await summarize_content(client, path, content)
+            return path, {"summary": summary, "hash": content_hash}
+        except Exception as e:
+            console.print(
+                f"[bold red]Failed to summarize {path} after multiple attempts: {str(e)}[/bold red]"
+            )
+            return path, {"summary": "Failed to generate summary", "hash": content_hash}
+
+    files_to_process: List[Tuple[str, str, str]] = list(traverse_docs(root_dir))
+    total_files = len(files_to_process)
+
+    with Progress() as progress:
+        task = progress.add_task("[green]Processing files...", total=total_files)
+
+        semaphore = asyncio.Semaphore(max_concurrency)
+
+        async def bounded_process_file(*args):
+            async with semaphore:
+                return await process_file(*args)
+
+        tasks = [
+            bounded_process_file(path, content, content_hash)
+            for path, content, content_hash in files_to_process
+        ]
+
+        for completed_task in as_completed(tasks):
+            path, result = await completed_task
+            sitemap_data[path] = result
+            progress.update(task, advance=1)
+
+            # Save intermediate results
+            with open(output_file, "w", encoding="utf-8") as sitemap_file:
+                yaml.dump(sitemap_data, sitemap_file, default_flow_style=False)
+
+    console.print(
+        f"[bold green]Sitemap has been generated and saved to {output_file}[/bold green]"
+    )
+
+
+app = typer.Typer()
+
+
+@app.command()
+def main(
+    root_dir: str = typer.Option("docs", help="Root directory to traverse"),
+    output_file: str = typer.Option("sitemap.yaml", help="Output file for the sitemap"),
+    api_key: Optional[str] = typer.Option(None, help="OpenAI API key"),
+    max_concurrency: int = typer.Option(5, help="Maximum number of concurrent tasks"),
+):
+    """
+    Generate a sitemap from the given root directory.
+    """
+    asyncio.run(generate_sitemap(root_dir, output_file, api_key, max_concurrency))
+
+
+if __name__ == "__main__":
+    app()