Skip to content

Commit

Permalink
Merge pull request #173 from enoch3712/136-removing-heavy-dependencie…
Browse files Browse the repository at this point in the history
…s-from-core-libraries

Remove Heavy dependencies
  • Loading branch information
enoch3712 authored Jan 2, 2025
2 parents db1cce1 + dd74625 commit 04712a0
Show file tree
Hide file tree
Showing 28 changed files with 690 additions and 1,143 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- name: Run critical tests
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: poetry run pytest tests/critical/ -v
run: poetry add pypdf && poetry run pytest tests/critical/ -v

- name: Build package
run: poetry build
Expand Down
55 changes: 28 additions & 27 deletions docs/core-concepts/document-loaders/aws-textract.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,35 @@

> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
## Prerequisite
## Installation

You need AWS credentials with access to Textract service. You will need:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_DEFAULT_REGION`
Install the required dependencies:

```python
%pip install --upgrade --quiet extract_thinker boto3
```bash
pip install boto3
```

## Basic Usage
## Prerequisites

1. An AWS account
2. AWS credentials with access to Textract service
3. AWS region where Textract is available

## Supported Formats

Here's a simple example of using the AWS Textract loader:
- Images: jpeg/jpg, png, tiff
- Documents: pdf

## Usage

```python
from extract_thinker import DocumentLoaderTextract
from extract_thinker import DocumentLoaderAWSTextract

# Initialize the loader
loader = DocumentLoaderTextract(
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
region_name=os.getenv('AWS_DEFAULT_REGION')
# Initialize the loader with AWS credentials
loader = DocumentLoaderAWSTextract(
aws_access_key_id="your-access-key",
aws_secret_access_key="your-secret-key",
region_name="your-region"
)

# Load document content
Expand Down Expand Up @@ -56,18 +62,13 @@ The loader returns a dictionary with the following structure:
}
```

## Best Practices

**Document Preparation**

- Use high-quality scans
- Support formats: `PDF`, `JPEG`, `PNG`
- Consider file size limits
## Supported Formats

**Performance**
`PDF`, `JPEG`, `PNG`

- Cache results when possible
- Process pages individually for large documents
- Monitor API quotas and costs
## Features

For more examples and implementation details, check out the [AWS Stack](../../../examples/aws-stack) in the repository.
- Text extraction with layout preservation
- Table detection and extraction
- Support for multiple document formats
- Automatic retries on API failures
72 changes: 33 additions & 39 deletions docs/core-concepts/document-loaders/azure-form.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,51 @@
# Azure Document Intelligence
# Azure Document Intelligence Document Loader

> Azure Document Intelligence (formerly known as `Azure Form Recognizer`) is a machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files.
The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.

## Prerequisite
## Installation

An Azure Document Intelligence resource in one of the 3 preview regions: `East US`, `West US2`, `West Europe`. You will be passing `<endpoint>` and `<key>` as parameters to the loader.
Install the required dependencies:

```python
%pip install --upgrade --quiet extract_thinker azure-ai-formrecognizer
```bash
pip install azure-ai-formrecognizer
```

## Basic Usage

Here's a simple example of using the Azure Document Intelligence Loader:
## Prerequisites

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderAzureForm
1. An Azure subscription
2. A Document Intelligence resource created in your Azure portal
3. The endpoint URL and subscription key from your Azure resource

# Initialize the loader with Azure credentials
subscription_key = os.getenv("AZURE_SUBSCRIPTION_KEY")
endpoint = os.getenv("AZURE_ENDPOINT")
## Supported Formats

loader = DocumentLoaderAzureForm(subscription_key, endpoint)
Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.

# Load document
content = loader.load("invoice.pdf")
## Usage

# Get content list (page by page)
content_list = loader.load_content_list("invoice.pdf")
```
```python
from extract_thinker import DocumentLoaderAzureForm

## Advanced Configuration
# Initialize the loader
loader = DocumentLoaderAzureForm(
subscription_key="your-subscription-key",
endpoint="your-endpoint-url"
)

The loader provides advanced features for handling tables and document structure:
# Load document
pages = loader.load("path/to/your/document.pdf")

```python
# The result will contain:
# - Paragraphs (text content)
# - Tables (structured data)
# Each page is processed separately

result = loader.load("document.pdf")
for page in result["pages"]:
# Access paragraphs
for paragraph in page["paragraphs"]:
print(f"Text: {paragraph}")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]

# Access tables
for table in page["tables"]:
print(f"Table data: {table}")
# Access tables (if any)
tables = page["tables"]
```

Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
## Features

For more examples and implementation details, check out the [Azure Stack](../../../examples/azure-stack) in the repository.
- Text extraction with layout preservation
- Table detection and extraction
- Support for multiple document formats
- Automatic table content deduplication from text
45 changes: 28 additions & 17 deletions docs/core-concepts/document-loaders/doc2txt.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,39 @@

The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.

## Basic Usage
## Installation

Install the required dependencies:

```bash
pip install docx2txt
```

## Supported Formats

- doc
- docx

## Usage

```python
from extract_thinker import Extractor, DocumentLoaderDoc2txt
from extract_thinker import DocumentLoaderDoc2txt

# Initialize the loader
loader = DocumentLoaderDoc2txt()

# Initialize the extractor with Doc2txt loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderDoc2txt())
# Load document
pages = loader.load("path/to/your/document.docx")

# Process a Word document
result = extractor.extract("document.docx", YourContract)
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
```

## Features

- Supports both `.doc` and `.docx` file formats
- Automatically splits content into pages using double newlines as separators
- Preserves text formatting and structure
- Caches results for improved performance

## Limitations

- Does not support vision mode (images within Word documents are not processed)
- Does not preserve complex formatting or document styling
- Tables and other structured content may lose their layout
- Text extraction from Word documents
- Support for both .doc and .docx formats
- Automatic page detection
- Preserves basic text formatting
89 changes: 32 additions & 57 deletions docs/core-concepts/document-loaders/google-document-ai.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,57 @@
# Google Document AI Document Loader

> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning.
The Google Document AI loader uses Google Cloud's Document AI service to extract text, tables, forms, and key-value pairs from documents.

## Prerequisite
## Installation

You need Google Cloud credentials and a Document AI processor. You will need:
- `DOCUMENTAI_GOOGLE_CREDENTIALS`
- `DOCUMENTAI_LOCATION`
- `DOCUMENTAI_PROCESSOR_NAME`

```python
%pip install --upgrade --quiet extract_thinker google-cloud-documentai
```bash
pip install google-cloud-documentai google-api-core google-oauth2-tool
```

## Basic Usage

Here's a simple example of using the Google Document AI loader:

```python
from extract_thinker import DocumentLoaderDocumentAI
from extract_thinker import DocumentLoaderGoogleDocumentAI

# Initialize the loader
loader = DocumentLoaderDocumentAI(
credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"),
location=os.getenv("DOCUMENTAI_LOCATION"),
processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME")
loader = DocumentLoaderGoogleDocumentAI(
project_id="your-project-id",
location="us", # or "eu"
processor_id="your-processor-id",
credentials="path/to/service-account.json" # or JSON string
)

# Load CV/Resume content
content = loader.load_content_from_file("CV_Candidate.pdf")
```

## Response Structure

The loader returns a dictionary containing:
```python
{
"pages": [
{
"content": "Full text content of the page",
"paragraphs": ["Paragraph 1", "Paragraph 2"],
"tables": [
[
["Header 1", "Header 2"],
["Value 1", "Value 2"]
]
]
}
]
}
```

## Processing Different Document Types

```python
# Process forms with tables
content = loader.load_content_from_file("form_with_tables.pdf")

# Process from stream
with open("document.pdf", "rb") as f:
content = loader.load_content_from_stream(
stream=f,
mime_type="application/pdf"
)
# Load document
pages = loader.load("path/to/your/document.pdf")

# Process extracted content
for page in pages:
# Access text content
text = page["content"]

# Access tables (if any)
tables = page["tables"]

# Access form fields (if any)
forms = page["forms"]

# Access key-value pairs (if any)
key_values = page["key_value_pairs"]
```

## Best Practices

1. **Document Types**
- Use appropriate processor for document type
- Ensure correct MIME type for streams
- Validate content structure

2. **Performance**
- Process in batches when possible
- Cache results for repeated access
- Monitor API quotas
## Features

Document AI supports `PDF`, `TIFF`, `GIF`, `JPEG`, `PNG` with a maximum file size of 20MB or 2000 pages.

For more examples and implementation details, check out the [Google Stack](../../../examples/google-stack) in the repository.
- Text extraction with layout preservation
- Table detection and extraction
- Form field detection
- Key-value pair extraction
- Support for multiple document formats
13 changes: 1 addition & 12 deletions docs/core-concepts/document-loaders/markitdown.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
Here's how to use the MarkItDown loader:

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderMarkItDown
from extract_thinker import DocumentLoaderMarkItDown

# Initialize the loader
loader = DocumentLoaderMarkItDown()
Expand All @@ -28,16 +27,6 @@ text = pages_with_images[0]["content"]
image = pages_with_images[0]["image"] # bytes object
```

## Features

- Multi-format support (`PDF`, `DOC`, `DOCX`, `PPT`, `PPTX`, `XLS`, `XLSX`, etc.)
- Text extraction from various file types
- Optional vision mode for image extraction
- Page-by-page processing
- Stream-based loading support
- Caching capabilities
- LLM integration support

## Supported Formats

- Documents: `PDF`, `DOC`, `DOCX`, `PPT`, `PPTX`, `XLS`, `XLSX`
Expand Down
Loading

0 comments on commit 04712a0

Please sign in to comment.