Merge pull request #191 from enoch3712/49-documentloaderconfig

49 documentloaderconfig
enoch3712 · Jan 13, 2025 · 921defd · 921defd
2 parents c89ca62 + e51887c
commit 921defd
Show file tree

Hide file tree

Showing 37 changed files with 3,375 additions and 599 deletions.
diff --git a/docs/core-concepts/document-loaders/aws-textract.md b/docs/core-concepts/document-loaders/aws-textract.md
@@ -1,74 +1,92 @@
 # AWS Textract Document Loader
 
-> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
-
-## Installation
-
-Install the required dependencies:
-
-```bash
-pip install boto3
-```
-
-## Prerequisites
-
-1. An AWS account
-2. AWS credentials with access to Textract service
-3. AWS region where Textract is available
+The AWS Textract loader uses Amazon's Textract service to extract text, forms, and tables from documents. It supports both image files and PDFs.
 
 ## Supported Formats
 
-- Images: jpeg/jpg, png, tiff
-- Documents: pdf
+- pdf
+- jpeg
+- png
+- tiff
 
 ## Usage
 
+### Basic Usage
+
 ```python
 from extract_thinker import DocumentLoaderAWSTextract
 
-# Initialize the loader with AWS credentials
+# Initialize with AWS credentials
 loader = DocumentLoaderAWSTextract(
-    aws_access_key_id="your-access-key",
-    aws_secret_access_key="your-secret-key",
-    region_name="your-region"
+    aws_access_key_id="your_access_key",
+    aws_secret_access_key="your_secret_key",
+    region_name="your_region"
 )
 
-# Load document content
-result = loader.load_content_from_file("document.pdf")
-```
+# Load document
+pages = loader.load("path/to/your/document.pdf")
 
-## Response Structure
+# Process extracted content
+for page in pages:
+    # Access text content
+    text = page["content"]
+    # Access tables if extracted
+    tables = page.get("tables", [])
+```
 
-The loader returns a dictionary with the following structure:
+### Configuration-based Usage
 
 ```python
-{
-    "pages": [
-        {
-            "paragraphs": ["text content..."],
-            "lines": ["line1", "line2"],
-            "words": ["word1", "word2"]
-        }
-    ],
-    "tables": [
-        [["cell1", "cell2"], ["cell3", "cell4"]]
-    ],
-    "forms": [
-        {"key": "value"}
-    ],
-    "layout": {
-        # Document layout information
-    }
-}
+from extract_thinker import DocumentLoaderAWSTextract, TextractConfig
+
+# Create configuration
+config = TextractConfig(
+    aws_access_key_id="your_access_key",
+    aws_secret_access_key="your_secret_key",
+    region_name="your_region",
+    feature_types=["TABLES", "FORMS", "SIGNATURES"],  # Specify features to extract
+    cache_ttl=600,                                    # Cache results for 10 minutes
+    max_retries=3                                     # Number of retry attempts
+)
+
+# Initialize loader with configuration
+loader = DocumentLoaderAWSTextract(config)
+
+# Load and process document
+pages = loader.load("path/to/your/document.pdf")
 ```
 
-## Supported Formats
+## Configuration Options
+
+The `TextractConfig` class supports the following options:
 
-`PDF`, `JPEG`, `PNG`
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `content` | Any | None | Initial content to process |
+| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
+| `aws_access_key_id` | str | None | AWS access key ID |
+| `aws_secret_access_key` | str | None | AWS secret access key |
+| `region_name` | str | None | AWS region name |
+| `textract_client` | boto3.client | None | Pre-configured Textract client |
+| `feature_types` | List[str] | [] | Features to extract (TABLES, FORMS, LAYOUT, SIGNATURES) |
+| `max_retries` | int | 3 | Maximum number of retry attempts |
 
 ## Features
 
-- Text extraction with layout preservation
+- Text extraction from images and PDFs
 - Table detection and extraction
-- Support for multiple document formats
-- Automatic retries on API failures 
+- Form field detection
+- Layout analysis
+- Signature detection
+- Configurable feature selection
+- Automatic retry on failure
+- Caching support
+- Support for pre-configured clients
+
+## Notes
+
+- Raw text extraction is the default when no feature types are specified
+- "QUERIES" feature type is not supported
+- Vision mode is supported for image formats
+- AWS credentials are required unless using a pre-configured client
+- Rate limits and quotas apply based on your AWS account 
diff --git a/docs/core-concepts/document-loaders/azure-form.md b/docs/core-concepts/document-loaders/azure-form.md
@@ -1,34 +1,23 @@
-# Azure Document Intelligence Document Loader
+# Azure Document Intelligence Loader
 
-The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.
-
-## Installation
-
-Install the required dependencies:
-
-```bash
-pip install azure-ai-formrecognizer
-```
-
-## Prerequisites
-
-1. An Azure subscription
-2. A Document Intelligence resource created in your Azure portal
-3. The endpoint URL and subscription key from your Azure resource
+The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.
 
 ## Supported Formats
 
 Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
 
 ## Usage
 
+### Basic Usage
+
 ```python
 from extract_thinker import DocumentLoaderAzureForm
 
-# Initialize the loader
+# Initialize with Azure credentials
 loader = DocumentLoaderAzureForm(
-    subscription_key="your-subscription-key",
-    endpoint="your-endpoint-url"
+    endpoint="your_endpoint",
+    key="your_api_key",
+    model="prebuilt-document"  # Use prebuilt document model
 )
 
 # Load document
@@ -38,14 +27,62 @@ pages = loader.load("path/to/your/document.pdf")
 for page in pages:
     # Access text content
     text = page["content"]
-
-    # Access tables (if any)
-    tables = page["tables"]
+    # Access tables if available
+    tables = page.get("tables", [])
 ```
 
+### Configuration-based Usage
+
+```python
+from extract_thinker import DocumentLoaderAzureForm, AzureConfig
+
+# Create configuration
+config = AzureConfig(
+    endpoint="your_endpoint",
+    key="your_api_key",
+    model="prebuilt-read",     # Use layout model for enhanced layout analysis
+    language="en",               # Specify document language
+    pages=[1, 2, 3],            # Process specific pages
+    cache_ttl=600               # Cache results for 10 minutes
+)
+
+# Initialize loader with configuration
+loader = DocumentLoaderAzureForm(config)
+
+# Load and process document
+pages = loader.load("path/to/your/document.pdf")
+```
+
+## Configuration Options
+
+The `AzureConfig` class supports the following options:
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `content` | Any | None | Initial content to process |
+| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
+| `endpoint` | str | None | Azure endpoint URL |
+| `key` | str | None | Azure API key |
+| `model` | str | "prebuilt-document" | Model ID to use |
+| `language` | str | None | Document language code |
+| `pages` | List[int] | None | Specific pages to process |
+| `reading_order` | str | "natural" | Text reading order |
+
 ## Features
 
 - Text extraction with layout preservation
 - Table detection and extraction
-- Support for multiple document formats
-- Automatic table content deduplication from text
+- Form field recognition
+- Multiple model support (document, layout, read)
+- Language specification
+- Page selection
+- Reading order control
+- Caching support
+- Support for pre-configured clients
+
+## Notes
+
+- Available models: "prebuilt-document", "prebuilt-layout", "prebuilt-read"
+- Vision mode is supported for image formats
+- Azure credentials are required
+- Rate limits and quotas apply based on your Azure subscription
diff --git a/docs/core-concepts/document-loaders/doc2txt.md b/docs/core-concepts/document-loaders/doc2txt.md
@@ -1,14 +1,6 @@
-# Microsoft Word Document Loader (Doc2txt)
+# Doc2txt Document Loader
 
-The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.
-
-## Installation
-
-Install the required dependencies:
-
-```bash
-pip install docx2txt
-```
+The Doc2txt loader extracts text from Microsoft Word documents. It supports both legacy (.doc) and modern (.docx) file formats.
 
 ## Supported Formats
 
@@ -17,10 +9,12 @@ pip install docx2txt
 
 ## Usage
 
+### Basic Usage
+
 ```python
 from extract_thinker import DocumentLoaderDoc2txt
 
-# Initialize the loader
+# Initialize with default settings
 loader = DocumentLoaderDoc2txt()
 
 # Load document
@@ -32,9 +26,52 @@ for page in pages:
     text = page["content"]
 ```
 
+### Configuration-based Usage
+
+```python
+from extract_thinker import DocumentLoaderDoc2txt, Doc2txtConfig
+
+# Create configuration
+config = Doc2txtConfig(
+    page_separator="\n\n---\n\n",  # Custom page separator
+    preserve_whitespace=True,      # Preserve original whitespace
+    extract_images=True,           # Extract embedded images
+    cache_ttl=600                  # Cache results for 10 minutes
+)
+
+# Initialize loader with configuration
+loader = DocumentLoaderDoc2txt(config)
+
+# Load and process document
+pages = loader.load("path/to/your/document.docx")
+```
+
+## Configuration Options
+
+The `Doc2txtConfig` class supports the following options:
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `content` | Any | None | Initial content to process |
+| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
+| `page_separator` | str | "\n\n" | Text to use as page separator |
+| `preserve_whitespace` | bool | False | Whether to preserve whitespace |
+| `extract_images` | bool | False | Whether to extract embedded images |
+
 ## Features
 
 - Text extraction from Word documents
-- Support for both .doc and .docx formats
-- Automatic page detection
-- Preserves basic text formatting
+- Support for both .doc and .docx
+- Custom page separation
+- Whitespace preservation
+- Image extraction (optional)
+- Caching support
+- No cloud service required
+
+## Notes
+
+- Vision mode is not supported
+- Image extraction requires additional memory
+- Local processing with no external dependencies
+- May not preserve complex formatting
+- Handles both legacy and modern Word formats