-
Notifications
You must be signed in to change notification settings - Fork 106
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #191 from enoch3712/49-documentloaderconfig
49 documentloaderconfig
- Loading branch information
Showing
37 changed files
with
3,375 additions
and
599 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,74 +1,92 @@ | ||
# AWS Textract Document Loader | ||
|
||
> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents. | ||
## Installation | ||
|
||
Install the required dependencies: | ||
|
||
```bash | ||
pip install boto3 | ||
``` | ||
|
||
## Prerequisites | ||
|
||
1. An AWS account | ||
2. AWS credentials with access to Textract service | ||
3. AWS region where Textract is available | ||
The AWS Textract loader uses Amazon's Textract service to extract text, forms, and tables from documents. It supports both image files and PDFs. | ||
|
||
## Supported Formats | ||
|
||
- Images: jpeg/jpg, png, tiff | ||
- Documents: pdf | ||
- jpeg | ||
- png | ||
- tiff | ||
|
||
## Usage | ||
|
||
### Basic Usage | ||
|
||
```python | ||
from extract_thinker import DocumentLoaderAWSTextract | ||
|
||
# Initialize the loader with AWS credentials | ||
# Initialize with AWS credentials | ||
loader = DocumentLoaderAWSTextract( | ||
aws_access_key_id="your-access-key", | ||
aws_secret_access_key="your-secret-key", | ||
region_name="your-region" | ||
aws_access_key_id="your_access_key", | ||
aws_secret_access_key="your_secret_key", | ||
region_name="your_region" | ||
) | ||
|
||
# Load document content | ||
result = loader.load_content_from_file("document.pdf") | ||
``` | ||
# Load document | ||
pages = loader.load("path/to/your/document.pdf") | ||
|
||
## Response Structure | ||
# Process extracted content | ||
for page in pages: | ||
# Access text content | ||
text = page["content"] | ||
# Access tables if extracted | ||
tables = page.get("tables", []) | ||
``` | ||
|
||
The loader returns a dictionary with the following structure: | ||
### Configuration-based Usage | ||
|
||
```python | ||
{ | ||
"pages": [ | ||
{ | ||
"paragraphs": ["text content..."], | ||
"lines": ["line1", "line2"], | ||
"words": ["word1", "word2"] | ||
} | ||
], | ||
"tables": [ | ||
[["cell1", "cell2"], ["cell3", "cell4"]] | ||
], | ||
"forms": [ | ||
{"key": "value"} | ||
], | ||
"layout": { | ||
# Document layout information | ||
} | ||
} | ||
from extract_thinker import DocumentLoaderAWSTextract, TextractConfig | ||
|
||
# Create configuration | ||
config = TextractConfig( | ||
aws_access_key_id="your_access_key", | ||
aws_secret_access_key="your_secret_key", | ||
region_name="your_region", | ||
feature_types=["TABLES", "FORMS", "SIGNATURES"], # Specify features to extract | ||
cache_ttl=600, # Cache results for 10 minutes | ||
max_retries=3 # Number of retry attempts | ||
) | ||
|
||
# Initialize loader with configuration | ||
loader = DocumentLoaderAWSTextract(config) | ||
|
||
# Load and process document | ||
pages = loader.load("path/to/your/document.pdf") | ||
``` | ||
|
||
## Supported Formats | ||
## Configuration Options | ||
|
||
The `TextractConfig` class supports the following options: | ||
|
||
`PDF`, `JPEG`, `PNG` | ||
| Option | Type | Default | Description | | ||
|--------|------|---------|-------------| | ||
| `content` | Any | None | Initial content to process | | ||
| `cache_ttl` | int | 300 | Cache time-to-live in seconds | | ||
| `aws_access_key_id` | str | None | AWS access key ID | | ||
| `aws_secret_access_key` | str | None | AWS secret access key | | ||
| `region_name` | str | None | AWS region name | | ||
| `textract_client` | boto3.client | None | Pre-configured Textract client | | ||
| `feature_types` | List[str] | [] | Features to extract (TABLES, FORMS, LAYOUT, SIGNATURES) | | ||
| `max_retries` | int | 3 | Maximum number of retry attempts | | ||
|
||
## Features | ||
|
||
- Text extraction with layout preservation | ||
- Text extraction from images and PDFs | ||
- Table detection and extraction | ||
- Support for multiple document formats | ||
- Automatic retries on API failures | ||
- Form field detection | ||
- Layout analysis | ||
- Signature detection | ||
- Configurable feature selection | ||
- Automatic retry on failure | ||
- Caching support | ||
- Support for pre-configured clients | ||
|
||
## Notes | ||
|
||
- Raw text extraction is the default when no feature types are specified | ||
- "QUERIES" feature type is not supported | ||
- Vision mode is supported for image formats | ||
- AWS credentials are required unless using a pre-configured client | ||
- Rate limits and quotas apply based on your AWS account |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.