Merge pull request #173 from enoch3712/136-removing-heavy-dependencie…

…s-from-core-libraries Remove Heavy dependencies
enoch3712 · Jan 2, 2025 · 04712a0 · 04712a0
2 parents db1cce1 + dd74625
commit 04712a0
Show file tree

Hide file tree

Showing 28 changed files with 690 additions and 1,143 deletions.
diff --git a/.github/workflows/workflow.yml b/.github/workflows/workflow.yml
@@ -26,7 +26,7 @@ jobs:
     - name: Run critical tests
       env:
         GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
-      run: poetry run pytest tests/critical/ -v
+      run: poetry add pypdf && poetry run pytest tests/critical/ -v
 
     - name: Build package
       run: poetry build

diff --git a/docs/core-concepts/document-loaders/aws-textract.md b/docs/core-concepts/document-loaders/aws-textract.md
@@ -2,29 +2,35 @@
 
 > AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
 
-## Prerequisite
+## Installation
 
-You need AWS credentials with access to Textract service. You will need:
-- `AWS_ACCESS_KEY_ID`
-- `AWS_SECRET_ACCESS_KEY`
-- `AWS_DEFAULT_REGION`
+Install the required dependencies:
 
-```python
-%pip install --upgrade --quiet extract_thinker boto3
+```bash
+pip install boto3
 ```
 
-## Basic Usage
+## Prerequisites
+
+1. An AWS account
+2. AWS credentials with access to Textract service
+3. AWS region where Textract is available
+
+## Supported Formats
 
-Here's a simple example of using the AWS Textract loader:
+- Images: jpeg/jpg, png, tiff
+- Documents: pdf
+
+## Usage
 
 ```python
-from extract_thinker import DocumentLoaderTextract
+from extract_thinker import DocumentLoaderAWSTextract
 
-# Initialize the loader
-loader = DocumentLoaderTextract(
-    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
-    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
-    region_name=os.getenv('AWS_DEFAULT_REGION')
+# Initialize the loader with AWS credentials
+loader = DocumentLoaderAWSTextract(
+    aws_access_key_id="your-access-key",
+    aws_secret_access_key="your-secret-key",
+    region_name="your-region"
 )
 
 # Load document content
@@ -56,18 +62,13 @@ The loader returns a dictionary with the following structure:
 }
 ```
 
-## Best Practices
-
-**Document Preparation**
-
-- Use high-quality scans
-- Support formats: `PDF`, `JPEG`, `PNG`
-- Consider file size limits
+## Supported Formats
 
-**Performance**
+`PDF`, `JPEG`, `PNG`
 
-- Cache results when possible
-- Process pages individually for large documents
-- Monitor API quotas and costs
+## Features
 
-For more examples and implementation details, check out the [AWS Stack](../../../examples/aws-stack) in the repository. 
+- Text extraction with layout preservation
+- Table detection and extraction
+- Support for multiple document formats
+- Automatic retries on API failures 
diff --git a/docs/core-concepts/document-loaders/azure-form.md b/docs/core-concepts/document-loaders/azure-form.md
@@ -1,57 +1,51 @@
-# Azure Document Intelligence
+# Azure Document Intelligence Document Loader
 
-> Azure Document Intelligence (formerly known as `Azure Form Recognizer`) is a machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files.
+The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.
 
-## Prerequisite
+## Installation
 
-An Azure Document Intelligence resource in one of the 3 preview regions: `East US`, `West US2`, `West Europe`. You will be passing `<endpoint>` and `<key>` as parameters to the loader.
+Install the required dependencies:
 
-```python
-%pip install --upgrade --quiet extract_thinker azure-ai-formrecognizer
+```bash
+pip install azure-ai-formrecognizer
 ```
 
-## Basic Usage
-
-Here's a simple example of using the Azure Document Intelligence Loader:
+## Prerequisites
 
-```python
-from extract_thinker import Extractor
-from extract_thinker.document_loader import DocumentLoaderAzureForm
+1. An Azure subscription
+2. A Document Intelligence resource created in your Azure portal
+3. The endpoint URL and subscription key from your Azure resource
 
-# Initialize the loader with Azure credentials
-subscription_key = os.getenv("AZURE_SUBSCRIPTION_KEY")
-endpoint = os.getenv("AZURE_ENDPOINT")
+## Supported Formats
 
-loader = DocumentLoaderAzureForm(subscription_key, endpoint)
+Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
 
-# Load document
-content = loader.load("invoice.pdf")
+## Usage
 
-# Get content list (page by page)
-content_list = loader.load_content_list("invoice.pdf")
-```
+```python
+from extract_thinker import DocumentLoaderAzureForm
 
-## Advanced Configuration
+# Initialize the loader
+loader = DocumentLoaderAzureForm(
+    subscription_key="your-subscription-key",
+    endpoint="your-endpoint-url"
+)
 
-The loader provides advanced features for handling tables and document structure:
+# Load document
+pages = loader.load("path/to/your/document.pdf")
 
-```python
-# The result will contain:
-# - Paragraphs (text content)
-# - Tables (structured data)
-# Each page is processed separately
-
-result = loader.load("document.pdf")
-for page in result["pages"]:
-    # Access paragraphs
-    for paragraph in page["paragraphs"]:
-        print(f"Text: {paragraph}")
+# Process extracted content
+for page in pages:
+    # Access text content
+    text = page["content"]
 
-    # Access tables
-    for table in page["tables"]:
-        print(f"Table data: {table}")
+    # Access tables (if any)
+    tables = page["tables"]
 ```
 
-Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
+## Features
 
-For more examples and implementation details, check out the [Azure Stack](../../../examples/azure-stack) in the repository.
+- Text extraction with layout preservation
+- Table detection and extraction
+- Support for multiple document formats
+- Automatic table content deduplication from text
diff --git a/docs/core-concepts/document-loaders/doc2txt.md b/docs/core-concepts/document-loaders/doc2txt.md
@@ -2,28 +2,39 @@
 
 The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.
 
-## Basic Usage
+## Installation
+
+Install the required dependencies:
+
+```bash
+pip install docx2txt
+```
+
+## Supported Formats
+
+- doc
+- docx
+
+## Usage
 
 ```python
-from extract_thinker import Extractor, DocumentLoaderDoc2txt
+from extract_thinker import DocumentLoaderDoc2txt
+
+# Initialize the loader
+loader = DocumentLoaderDoc2txt()
 
-# Initialize the extractor with Doc2txt loader
-extractor = Extractor()
-extractor.load_document_loader(DocumentLoaderDoc2txt())
+# Load document
+pages = loader.load("path/to/your/document.docx")
 
-# Process a Word document
-result = extractor.extract("document.docx", YourContract)
+# Process extracted content
+for page in pages:
+    # Access text content
+    text = page["content"]
 ```
 
 ## Features
 
-- Supports both `.doc` and `.docx` file formats
-- Automatically splits content into pages using double newlines as separators
-- Preserves text formatting and structure
-- Caches results for improved performance
-
-## Limitations
-
-- Does not support vision mode (images within Word documents are not processed)
-- Does not preserve complex formatting or document styling
-- Tables and other structured content may lose their layout
+- Text extraction from Word documents
+- Support for both .doc and .docx formats
+- Automatic page detection
+- Preserves basic text formatting
diff --git a/docs/core-concepts/document-loaders/google-document-ai.md b/docs/core-concepts/document-loaders/google-document-ai.md
@@ -1,82 +1,57 @@
 # Google Document AI Document Loader
 
-> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning.
+The Google Document AI loader uses Google Cloud's Document AI service to extract text, tables, forms, and key-value pairs from documents.
 
-## Prerequisite
+## Installation
 
 You need Google Cloud credentials and a Document AI processor. You will need:
 - `DOCUMENTAI_GOOGLE_CREDENTIALS`
 - `DOCUMENTAI_LOCATION`
 - `DOCUMENTAI_PROCESSOR_NAME`
 
-```python
-%pip install --upgrade --quiet extract_thinker google-cloud-documentai
+```bash
+pip install google-cloud-documentai google-api-core google-oauth2-tool
 ```
 
 ## Basic Usage
 
 Here's a simple example of using the Google Document AI loader:
 
 ```python
-from extract_thinker import DocumentLoaderDocumentAI
+from extract_thinker import DocumentLoaderGoogleDocumentAI
 
 # Initialize the loader
-loader = DocumentLoaderDocumentAI(
-    credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"),
-    location=os.getenv("DOCUMENTAI_LOCATION"),
-    processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME")
+loader = DocumentLoaderGoogleDocumentAI(
+    project_id="your-project-id",
+    location="us",  # or "eu"
+    processor_id="your-processor-id",
+    credentials="path/to/service-account.json"  # or JSON string
 )
 
-# Load CV/Resume content
-content = loader.load_content_from_file("CV_Candidate.pdf")
-```
-
-## Response Structure
-
-The loader returns a dictionary containing:
-```python
-{
-    "pages": [
-        {
-            "content": "Full text content of the page",
-            "paragraphs": ["Paragraph 1", "Paragraph 2"],
-            "tables": [
-                [
-                    ["Header 1", "Header 2"],
-                    ["Value 1", "Value 2"]
-                ]
-            ]
-        }
-    ]
-}
-```
-
-## Processing Different Document Types
-
-```python
-# Process forms with tables
-content = loader.load_content_from_file("form_with_tables.pdf")
-
-# Process from stream
-with open("document.pdf", "rb") as f:
-    content = loader.load_content_from_stream(
-        stream=f,
-        mime_type="application/pdf"
-    )
+# Load document
+pages = loader.load("path/to/your/document.pdf")
+
+# Process extracted content
+for page in pages:
+    # Access text content
+    text = page["content"]
+
+    # Access tables (if any)
+    tables = page["tables"]
+
+    # Access form fields (if any)
+    forms = page["forms"]
+
+    # Access key-value pairs (if any)
+    key_values = page["key_value_pairs"]
 ```
 
-## Best Practices
-
-1. **Document Types**
-   - Use appropriate processor for document type
-   - Ensure correct MIME type for streams
-   - Validate content structure
-
-2. **Performance**
-   - Process in batches when possible
-   - Cache results for repeated access
-   - Monitor API quotas
+## Features
 
 Document AI supports `PDF`, `TIFF`, `GIF`, `JPEG`, `PNG` with a maximum file size of 20MB or 2000 pages.
 
-For more examples and implementation details, check out the [Google Stack](../../../examples/google-stack) in the repository. 
+- Text extraction with layout preservation
+- Table detection and extraction
+- Form field detection
+- Key-value pair extraction
+- Support for multiple document formats
diff --git a/docs/core-concepts/document-loaders/markitdown.md b/docs/core-concepts/document-loaders/markitdown.md
@@ -7,8 +7,7 @@
 Here's how to use the MarkItDown loader:
 
 ```python
-from extract_thinker import Extractor
-from extract_thinker.document_loader import DocumentLoaderMarkItDown
+from extract_thinker import DocumentLoaderMarkItDown
 
 # Initialize the loader
 loader = DocumentLoaderMarkItDown()
@@ -28,16 +27,6 @@ text = pages_with_images[0]["content"]
 image = pages_with_images[0]["image"]  # bytes object
 ```
 
-## Features
-
-- Multi-format support (`PDF`, `DOC`, `DOCX`, `PPT`, `PPTX`, `XLS`, `XLSX`, etc.)
-- Text extraction from various file types
-- Optional vision mode for image extraction
-- Page-by-page processing
-- Stream-based loading support
-- Caching capabilities
-- LLM integration support
-
 ## Supported Formats
 
 - Documents: `PDF`, `DOC`, `DOCX`, `PPT`, `PPTX`, `XLS`, `XLSX`