-
Notifications
You must be signed in to change notification settings - Fork 106
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #173 from enoch3712/136-removing-heavy-dependencie…
…s-from-core-libraries Remove Heavy dependencies
- Loading branch information
Showing
28 changed files
with
690 additions
and
1,143 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,57 +1,51 @@ | ||
# Azure Document Intelligence | ||
# Azure Document Intelligence Document Loader | ||
|
||
> Azure Document Intelligence (formerly known as `Azure Form Recognizer`) is a machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files. | ||
The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents. | ||
|
||
## Prerequisite | ||
## Installation | ||
|
||
An Azure Document Intelligence resource in one of the 3 preview regions: `East US`, `West US2`, `West Europe`. You will be passing `<endpoint>` and `<key>` as parameters to the loader. | ||
Install the required dependencies: | ||
|
||
```python | ||
%pip install --upgrade --quiet extract_thinker azure-ai-formrecognizer | ||
```bash | ||
pip install azure-ai-formrecognizer | ||
``` | ||
|
||
## Basic Usage | ||
|
||
Here's a simple example of using the Azure Document Intelligence Loader: | ||
## Prerequisites | ||
|
||
```python | ||
from extract_thinker import Extractor | ||
from extract_thinker.document_loader import DocumentLoaderAzureForm | ||
1. An Azure subscription | ||
2. A Document Intelligence resource created in your Azure portal | ||
3. The endpoint URL and subscription key from your Azure resource | ||
|
||
# Initialize the loader with Azure credentials | ||
subscription_key = os.getenv("AZURE_SUBSCRIPTION_KEY") | ||
endpoint = os.getenv("AZURE_ENDPOINT") | ||
## Supported Formats | ||
|
||
loader = DocumentLoaderAzureForm(subscription_key, endpoint) | ||
Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`. | ||
|
||
# Load document | ||
content = loader.load("invoice.pdf") | ||
## Usage | ||
|
||
# Get content list (page by page) | ||
content_list = loader.load_content_list("invoice.pdf") | ||
``` | ||
```python | ||
from extract_thinker import DocumentLoaderAzureForm | ||
|
||
## Advanced Configuration | ||
# Initialize the loader | ||
loader = DocumentLoaderAzureForm( | ||
subscription_key="your-subscription-key", | ||
endpoint="your-endpoint-url" | ||
) | ||
|
||
The loader provides advanced features for handling tables and document structure: | ||
# Load document | ||
pages = loader.load("path/to/your/document.pdf") | ||
|
||
```python | ||
# The result will contain: | ||
# - Paragraphs (text content) | ||
# - Tables (structured data) | ||
# Each page is processed separately | ||
|
||
result = loader.load("document.pdf") | ||
for page in result["pages"]: | ||
# Access paragraphs | ||
for paragraph in page["paragraphs"]: | ||
print(f"Text: {paragraph}") | ||
# Process extracted content | ||
for page in pages: | ||
# Access text content | ||
text = page["content"] | ||
|
||
# Access tables | ||
for table in page["tables"]: | ||
print(f"Table data: {table}") | ||
# Access tables (if any) | ||
tables = page["tables"] | ||
``` | ||
|
||
Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`. | ||
## Features | ||
|
||
For more examples and implementation details, check out the [Azure Stack](../../../examples/azure-stack) in the repository. | ||
- Text extraction with layout preservation | ||
- Table detection and extraction | ||
- Support for multiple document formats | ||
- Automatic table content deduplication from text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
89 changes: 32 additions & 57 deletions
89
docs/core-concepts/document-loaders/google-document-ai.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,82 +1,57 @@ | ||
# Google Document AI Document Loader | ||
|
||
> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning. | ||
The Google Document AI loader uses Google Cloud's Document AI service to extract text, tables, forms, and key-value pairs from documents. | ||
|
||
## Prerequisite | ||
## Installation | ||
|
||
You need Google Cloud credentials and a Document AI processor. You will need: | ||
- `DOCUMENTAI_GOOGLE_CREDENTIALS` | ||
- `DOCUMENTAI_LOCATION` | ||
- `DOCUMENTAI_PROCESSOR_NAME` | ||
|
||
```python | ||
%pip install --upgrade --quiet extract_thinker google-cloud-documentai | ||
```bash | ||
pip install google-cloud-documentai google-api-core google-oauth2-tool | ||
``` | ||
|
||
## Basic Usage | ||
|
||
Here's a simple example of using the Google Document AI loader: | ||
|
||
```python | ||
from extract_thinker import DocumentLoaderDocumentAI | ||
from extract_thinker import DocumentLoaderGoogleDocumentAI | ||
|
||
# Initialize the loader | ||
loader = DocumentLoaderDocumentAI( | ||
credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"), | ||
location=os.getenv("DOCUMENTAI_LOCATION"), | ||
processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME") | ||
loader = DocumentLoaderGoogleDocumentAI( | ||
project_id="your-project-id", | ||
location="us", # or "eu" | ||
processor_id="your-processor-id", | ||
credentials="path/to/service-account.json" # or JSON string | ||
) | ||
|
||
# Load CV/Resume content | ||
content = loader.load_content_from_file("CV_Candidate.pdf") | ||
``` | ||
|
||
## Response Structure | ||
|
||
The loader returns a dictionary containing: | ||
```python | ||
{ | ||
"pages": [ | ||
{ | ||
"content": "Full text content of the page", | ||
"paragraphs": ["Paragraph 1", "Paragraph 2"], | ||
"tables": [ | ||
[ | ||
["Header 1", "Header 2"], | ||
["Value 1", "Value 2"] | ||
] | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Processing Different Document Types | ||
|
||
```python | ||
# Process forms with tables | ||
content = loader.load_content_from_file("form_with_tables.pdf") | ||
|
||
# Process from stream | ||
with open("document.pdf", "rb") as f: | ||
content = loader.load_content_from_stream( | ||
stream=f, | ||
mime_type="application/pdf" | ||
) | ||
# Load document | ||
pages = loader.load("path/to/your/document.pdf") | ||
|
||
# Process extracted content | ||
for page in pages: | ||
# Access text content | ||
text = page["content"] | ||
|
||
# Access tables (if any) | ||
tables = page["tables"] | ||
|
||
# Access form fields (if any) | ||
forms = page["forms"] | ||
|
||
# Access key-value pairs (if any) | ||
key_values = page["key_value_pairs"] | ||
``` | ||
|
||
## Best Practices | ||
|
||
1. **Document Types** | ||
- Use appropriate processor for document type | ||
- Ensure correct MIME type for streams | ||
- Validate content structure | ||
|
||
2. **Performance** | ||
- Process in batches when possible | ||
- Cache results for repeated access | ||
- Monitor API quotas | ||
## Features | ||
|
||
Document AI supports `PDF`, `TIFF`, `GIF`, `JPEG`, `PNG` with a maximum file size of 20MB or 2000 pages. | ||
|
||
For more examples and implementation details, check out the [Google Stack](../../../examples/google-stack) in the repository. | ||
- Text extraction with layout preservation | ||
- Table detection and extraction | ||
- Form field detection | ||
- Key-value pair extraction | ||
- Support for multiple document formats |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.