Skip to content

Commit

Permalink
Merge pull request #191 from enoch3712/49-documentloaderconfig
Browse files Browse the repository at this point in the history
49 documentloaderconfig
  • Loading branch information
enoch3712 authored Jan 13, 2025
2 parents c89ca62 + e51887c commit 921defd
Show file tree
Hide file tree
Showing 37 changed files with 3,375 additions and 599 deletions.
116 changes: 67 additions & 49 deletions docs/core-concepts/document-loaders/aws-textract.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,92 @@
# AWS Textract Document Loader

> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
## Installation

Install the required dependencies:

```bash
pip install boto3
```

## Prerequisites

1. An AWS account
2. AWS credentials with access to Textract service
3. AWS region where Textract is available
The AWS Textract loader uses Amazon's Textract service to extract text, forms, and tables from documents. It supports both image files and PDFs.

## Supported Formats

- Images: jpeg/jpg, png, tiff
- Documents: pdf
- pdf
- jpeg
- png
- tiff

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderAWSTextract

# Initialize the loader with AWS credentials
# Initialize with AWS credentials
loader = DocumentLoaderAWSTextract(
aws_access_key_id="your-access-key",
aws_secret_access_key="your-secret-key",
region_name="your-region"
aws_access_key_id="your_access_key",
aws_secret_access_key="your_secret_key",
region_name="your_region"
)

# Load document content
result = loader.load_content_from_file("document.pdf")
```
# Load document
pages = loader.load("path/to/your/document.pdf")

## Response Structure
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
# Access tables if extracted
tables = page.get("tables", [])
```

The loader returns a dictionary with the following structure:
### Configuration-based Usage

```python
{
"pages": [
{
"paragraphs": ["text content..."],
"lines": ["line1", "line2"],
"words": ["word1", "word2"]
}
],
"tables": [
[["cell1", "cell2"], ["cell3", "cell4"]]
],
"forms": [
{"key": "value"}
],
"layout": {
# Document layout information
}
}
from extract_thinker import DocumentLoaderAWSTextract, TextractConfig

# Create configuration
config = TextractConfig(
aws_access_key_id="your_access_key",
aws_secret_access_key="your_secret_key",
region_name="your_region",
feature_types=["TABLES", "FORMS", "SIGNATURES"], # Specify features to extract
cache_ttl=600, # Cache results for 10 minutes
max_retries=3 # Number of retry attempts
)

# Initialize loader with configuration
loader = DocumentLoaderAWSTextract(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")
```

## Supported Formats
## Configuration Options

The `TextractConfig` class supports the following options:

`PDF`, `JPEG`, `PNG`
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `aws_access_key_id` | str | None | AWS access key ID |
| `aws_secret_access_key` | str | None | AWS secret access key |
| `region_name` | str | None | AWS region name |
| `textract_client` | boto3.client | None | Pre-configured Textract client |
| `feature_types` | List[str] | [] | Features to extract (TABLES, FORMS, LAYOUT, SIGNATURES) |
| `max_retries` | int | 3 | Maximum number of retry attempts |

## Features

- Text extraction with layout preservation
- Text extraction from images and PDFs
- Table detection and extraction
- Support for multiple document formats
- Automatic retries on API failures
- Form field detection
- Layout analysis
- Signature detection
- Configurable feature selection
- Automatic retry on failure
- Caching support
- Support for pre-configured clients

## Notes

- Raw text extraction is the default when no feature types are specified
- "QUERIES" feature type is not supported
- Vision mode is supported for image formats
- AWS credentials are required unless using a pre-configured client
- Rate limits and quotas apply based on your AWS account
85 changes: 61 additions & 24 deletions docs/core-concepts/document-loaders/azure-form.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,23 @@
# Azure Document Intelligence Document Loader
# Azure Document Intelligence Loader

The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.

## Installation

Install the required dependencies:

```bash
pip install azure-ai-formrecognizer
```

## Prerequisites

1. An Azure subscription
2. A Document Intelligence resource created in your Azure portal
3. The endpoint URL and subscription key from your Azure resource
The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.

## Supported Formats

Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderAzureForm

# Initialize the loader
# Initialize with Azure credentials
loader = DocumentLoaderAzureForm(
subscription_key="your-subscription-key",
endpoint="your-endpoint-url"
endpoint="your_endpoint",
key="your_api_key",
model="prebuilt-document" # Use prebuilt document model
)

# Load document
Expand All @@ -38,14 +27,62 @@ pages = loader.load("path/to/your/document.pdf")
for page in pages:
# Access text content
text = page["content"]

# Access tables (if any)
tables = page["tables"]
# Access tables if available
tables = page.get("tables", [])
```

### Configuration-based Usage

```python
from extract_thinker import DocumentLoaderAzureForm, AzureConfig

# Create configuration
config = AzureConfig(
endpoint="your_endpoint",
key="your_api_key",
model="prebuilt-read", # Use layout model for enhanced layout analysis
language="en", # Specify document language
pages=[1, 2, 3], # Process specific pages
cache_ttl=600 # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderAzureForm(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")
```

## Configuration Options

The `AzureConfig` class supports the following options:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `endpoint` | str | None | Azure endpoint URL |
| `key` | str | None | Azure API key |
| `model` | str | "prebuilt-document" | Model ID to use |
| `language` | str | None | Document language code |
| `pages` | List[int] | None | Specific pages to process |
| `reading_order` | str | "natural" | Text reading order |

## Features

- Text extraction with layout preservation
- Table detection and extraction
- Support for multiple document formats
- Automatic table content deduplication from text
- Form field recognition
- Multiple model support (document, layout, read)
- Language specification
- Page selection
- Reading order control
- Caching support
- Support for pre-configured clients

## Notes

- Available models: "prebuilt-document", "prebuilt-layout", "prebuilt-read"
- Vision mode is supported for image formats
- Azure credentials are required
- Rate limits and quotas apply based on your Azure subscription
65 changes: 51 additions & 14 deletions docs/core-concepts/document-loaders/doc2txt.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
# Microsoft Word Document Loader (Doc2txt)
# Doc2txt Document Loader

The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.

## Installation

Install the required dependencies:

```bash
pip install docx2txt
```
The Doc2txt loader extracts text from Microsoft Word documents. It supports both legacy (.doc) and modern (.docx) file formats.

## Supported Formats

Expand All @@ -17,10 +9,12 @@ pip install docx2txt

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderDoc2txt

# Initialize the loader
# Initialize with default settings
loader = DocumentLoaderDoc2txt()

# Load document
Expand All @@ -32,9 +26,52 @@ for page in pages:
text = page["content"]
```

### Configuration-based Usage

```python
from extract_thinker import DocumentLoaderDoc2txt, Doc2txtConfig

# Create configuration
config = Doc2txtConfig(
page_separator="\n\n---\n\n", # Custom page separator
preserve_whitespace=True, # Preserve original whitespace
extract_images=True, # Extract embedded images
cache_ttl=600 # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderDoc2txt(config)

# Load and process document
pages = loader.load("path/to/your/document.docx")
```

## Configuration Options

The `Doc2txtConfig` class supports the following options:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `page_separator` | str | "\n\n" | Text to use as page separator |
| `preserve_whitespace` | bool | False | Whether to preserve whitespace |
| `extract_images` | bool | False | Whether to extract embedded images |

## Features

- Text extraction from Word documents
- Support for both .doc and .docx formats
- Automatic page detection
- Preserves basic text formatting
- Support for both .doc and .docx
- Custom page separation
- Whitespace preservation
- Image extraction (optional)
- Caching support
- No cloud service required

## Notes

- Vision mode is not supported
- Image extraction requires additional memory
- Local processing with no external dependencies
- May not preserve complex formatting
- Handles both legacy and modern Word formats
Loading

0 comments on commit 921defd

Please sign in to comment.