Skip to content

Commit

Permalink
Merge branch 'main' into github-actions-63
Browse files Browse the repository at this point in the history
  • Loading branch information
edknv authored Sep 24, 2024
2 parents eed9459 + e73cdc2 commit 9ef3b75
Show file tree
Hide file tree
Showing 9 changed files with 202 additions and 552 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Build NV-Ingest Runtime Image

# Trigger for pull requests and pushing to main
on:
pull_request:
types:
- opened
- synchronize
- reopened

jobs:
build:
runs-on: linux-large-disk

steps:
- name: Checkout code
uses: actions/checkout@v4

# Set up Docker Buildx, useful for building multi-platform images
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

# Build the Docker image using the Dockerfile
- name: Build Docker image
run: |
docker build -t nv-ingest:latest .
- name: Run Pytest inside Docker container
run: |
docker run nv-ingest:latest pytest -rs --cov nv_ingest --cov nv_ingest_client --cov-report term --cov-report xml:coverage.xml tests/nv_ingest tests/nv_ingest_client
- name: Upload test report
uses: actions/upload-artifact@v4
with:
name: pytest-report
path: report.xml
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ ENV NV_INGEST_CLIENT_VERSION_OVERRIDE=${NV_INGEST_VERSION_OVERRIDE}
RUN source activate morpheus \
&& pip install -r requirements.txt

COPY tests tests
COPY data data
COPY client client
COPY src/nv_ingest src/nv_ingest
RUN rm -rf ./src/nv_ingest/dist ./client/dist
Expand Down
126 changes: 72 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@ NVIDIA-Ingest is a scalable, performance-oriented document content and metadata
NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database [Milvus](https://milvus.io/).

### Table of Contents
1. [Introduction](#what-it-is)
1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Quickstart](#quickstart)
4. [Repo Structure](#repo-structure)
5. [Notices](#notices)

### What it is
## Introduction

### What NVIDIA-Ingest is ✔️

A microservice that:

Expand All @@ -28,7 +30,7 @@ A microservice that:
- Supports multiple methods of extraction for each document type in order to balance trade-offs between throughput and accuracy. For example, for PDF documents we support extraction via pdfium, Unstructured.io, and Adobe Content Extraction Services.
- Supports various types of pre and post processing operations, including text splitting and chunking; transform, and filtering; embedding generation, and image offloading to storage.

### What it is not
### What NVIDIA-Ingest is not ✖️

A service that:

Expand All @@ -40,67 +42,74 @@ A service that:

### Hardware

| GPU | Family | Memory | # of GPUs |
| GPU | Family | Memory | # of GPUs (min.) |
| ------ | ------ | ------ | ------ |
| H100 | SXM/NVLink or PCIe | 80GB | 2 |
| A100 | SXM/NVLink or PCIe | 80GB | 2 |

### Software

- Linux operating systems (Ubuntu 20.04 or later recommended)
- Linux operating systems (Ubuntu 22.04 or later recommended)
- [Docker](https://docs.docker.com/engine/install/)
- [Docker Compose](https://docs.docker.com/compose/install/)
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (NVIDIA Driver >= 535)
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (NVIDIA Driver >= `535`, CUDA >= `12.2`)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)


## Quickstart

To get started using NVIDIA Ingest, you need to do a few things:
1. [Start supporting NIM microservices](#step-1-starting-containers)
2. [Install the NVIDIA Ingest client dependencies in a Python environment](#step-2-installing-python-dependencies)
3. [Submit ingestion job(s)](#step-3-ingesting-documents)
4. [Inspect and consume results](#step-4-inspecting-and-consuming-results)
1. [Start supporting NIM microservices](#step-1-starting-containers) 🏗️
2. [Install the NVIDIA Ingest client dependencies in a Python environment](#step-2-installing-python-dependencies) 🐍
3. [Submit ingestion job(s)](#step-3-ingesting-documents) 📓
4. [Inspect and consume results](#step-4-inspecting-and-consuming-results) 🔍

### Step 1: Starting containers

This example demonstrates how to use the provided [docker-compose.yaml](docker-compose.yaml) to start all needed services with a few commands.

> [!IMPORTANT]
> NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
If preferred, you can also [start services one by one](docs/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/environment-config.md) you may wish to configure.

1. First, git clone the repo:
`git clone https://github.com/nvidia/nv-ingest` and `cd nv-ingest`.
1. Git clone the repo:
`git clone https://github.com/nvidia/nv-ingest`
2. Change directory to the cloned repo
`cd nv-ingest`.

2. To access pre-built containers and NIM microservices, [generate API keys](docs/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
3. [Generate API keys](docs/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
```shell
# This is required to access pre-built containers and NIM microservices
$ docker login nvcr.io
Username: $oauthtoken
Password: <Your Key>
```

3. Container images must access resources from NGC. Create a .env file containing your NGC API key, and the following paths:
4. Create a .env file containing your NGC API key, and the following paths:
```
# Container images must access resources from NGC.
NGC_API_KEY=...
DATASET_ROOT=<PATH_TO_THIS_REPO>/data
NV_INGEST_ROOT=<PATH_TO_THIS_REPO>
```

Note: As configured by default in [docker-compose.yaml](docker-compose.yaml#L52), the DePlot NIM is on a dedicated GPU. All other NIMs and the nv-ingest container itself share a second. This is to avoid DePlot and other NIMs competing for VRAM on the same device. You can change the `CUDA_VISIBLE_DEVICES` pinnings as desired for your system within docker-compose.yaml.
> [!NOTE]
> As configured by default in [docker-compose.yaml](docker-compose.yaml#L52), the DePlot NIM is on a dedicated GPU. All other NIMs and the nv-ingest container itself share a second. This is to avoid DePlot and other NIMs competing for VRAM on the same device.
>
> Change the `CUDA_VISIBLE_DEVICES` pinnings as desired for your system within docker-compose.yaml.
4. To start all services:
5. Start all services:
`docker compose up`

Please note, NIM containers on their first startup can take 10-15 minutes to pull and fully load models. Also note that by default we have [configured log levels to be verbose](docker-compose.yaml#L27) so it's possible to observe service startup proceeding. You will notice _many_ log messages. You can turn off verbose logging by configuring `NIM_TRITON_LOG_VERBOSE=0` for each NIM in [docker-compose.yaml](docker-compose.yaml).

Also note that nv-ingest is in Early Access mode, meaning the codebase gets frequent updates. To build an updated nv-ingest service container with the latest changes you can:
```
docker compose build
```

After the image is built, run `docker compose up` as above.
> [!TIP]
> By default we have [configured log levels to be verbose](docker-compose.yaml#L27).
>
> It's possible to observe service startup proceeding: you will notice _many_ log messages. Disable verbose logging by configuring `NIM_TRITON_LOG_VERBOSE=0` for each NIM in [docker-compose.yaml](docker-compose.yaml).
5. When all services have fully started, `nvidia-smi` should show processes like the following:
6. When all services have fully started, `nvidia-smi` should show processes like the following:
```
# If it's taking > 1m for `nvidia-smi` to return, it's likely the bus is still busy setting up the models.
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
Expand All @@ -116,11 +125,9 @@ After the image is built, run `docker compose up` as above.
| 3 N/A N/A 1373202 C tritonserver 414MiB |
+---------------------------------------------------------------------------------------+
```
If it's taking > 1m for `nvidia-smi` to return, it's likely the bus is still busy setting up the models.

Once it completes normally (less than a few seconds), the NIM models are ready.

You can observe the started containers with `docker ps`:
Observe the started containers with `docker ps`:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f2f86615ea5 nvcr.io/ohlfw0olaadg/ea-participants/nv-ingest:24.08 "/opt/conda/bin/tini…" 35 seconds ago Up 33 seconds 0.0.0.0:7670->7670/tcp, :::7670->7670/tcp nv-ingest-nv-ingest-ms-runtime-1
Expand All @@ -136,9 +143,17 @@ bda9a2a9c8b5 openzipkin/zipkin
ac27e5297d57 prom/prometheus:latest "/bin/prometheus --w…" 14 hours ago Up 33 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp nv-ingest-prometheus-1
```

> [!TIP]
> nv-ingest is in Early Access mode, meaning the codebase gets frequent updates. To build an updated nv-ingest service container with the latest changes you can:
> ```
> docker compose build
> ```
>
> After the image is built, run `docker compose up` per item 5 above.
### Step 2: Installing Python dependencies

To interact with the nv-ingest service, you can do so from the host, or by `docker exec`ing into the nv-ingest container.
To interact with the nv-ingest service, you can do so from the host, or by `docker exec`-ing into the nv-ingest container.

To interact from the host, you'll need a Python environment and install the client dependencies:
```
Expand All @@ -150,33 +165,35 @@ pip install -r ./requirements.txt
pip install .
```

Note that interacting from the host depends on the appropriate port being exposed from the nv-ingest container to the host as defined in [docker-compose.yaml](docker-compose.yaml#L141).

If you prefer, you can disable exposing that port, and interact with the nv-ingest service directly from within its container.

To interact within the container:
```
docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash
```
You'll be in the `/workspace` directory, which has `DATASET_ROOT` from the .env file mounted at `./data`. The pre-activated `morpheus` conda environment has all the python client libraries pre-installed:
```
(morpheus) root@aba77e2a4bde:/workspace#
```

From the bash prompt above, you can run nv-ingest-cli and Python examples described below.
> [!NOTE]
> Interacting from the host depends on the appropriate port being exposed from the nv-ingest container to the host as defined in [docker-compose.yaml](docker-compose.yaml#L141).
>
> If you prefer, you can disable exposing that port, and interact with the nv-ingest service directly from within its container.
>
> To interact within the container:
> ```
> docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash
> ```
> You'll be in the `/workspace` directory, which has `DATASET_ROOT` from the .env file mounted at `./data`. The pre-activated `morpheus` conda environment has all the python client libraries pre-installed:
> ```
> (morpheus) root@aba77e2a4bde:/workspace#
> ```
>
> From the bash prompt above, you can run nv-ingest-cli and Python examples described below.
### Step 3: Ingesting Documents

You can submit jobs programmatically in Python or via the nv-ingest-cli tool.

In the below examples, we are doing text, chart, table, and image extraction:
- `extract_text`, - uses PDFium to find and extract text from pages
- `extract_images` - uses PDFium to extract images
- `extract_tables` - uses YOLOX to find tables and charts. Uses PaddleOCR for table extraction, and Deplot, CACHED, and PaddleOCR for chart extraction
- `extract_text`, - uses [PDFium](https://github.com/pypdfium2-team/pypdfium2/) to find and extract text from pages
- `extract_images` - uses [PDFium](https://github.com/pypdfium2-team/pypdfium2/) to extract images
- `extract_tables` - uses [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX) to find tables and charts. Uses [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for table extraction, and [Deplot](https://huggingface.co/google/deplot), CACHED, and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for chart extraction

Note that `extract_tables` controls extraction for both tables and charts.
> [!IMPORTANT]
> `extract_tables` controls extraction for both tables and charts.
In Python (you can find more documentation and examples [here](./client/client_examples/examples/python_client_usage.ipynb)):
#### In Python (you can find more documentation and examples [here](./client/client_examples/examples/python_client_usage.ipynb)):

```python
import logging, time
Expand Down Expand Up @@ -231,7 +248,7 @@ result = client.fetch_job_result(job_id, timeout=60)
print(f"Got {len(result)} results")
```

Using the the `nv-ingest-cli` (you can find more nv-ingest-cli examples [here](./client/client_examples/examples/cli_client_usage.ipynb)):
#### Using the the `nv-ingest-cli` (you can find more nv-ingest-cli examples [here](./client/client_examples/examples/cli_client_usage.ipynb)):

```shell
nv-ingest-cli \
Expand Down Expand Up @@ -275,7 +292,7 @@ INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.10

After the ingestion steps above have completed, you should be able to find `text` and `image` subfolders inside your processed docs folder. Each will contain JSON formatted extracted content and metadata.

When processing has completed, you'll have separate result files for text and image data:
#### When processing has completed, you'll have separate result files for text and image data:
```shell
ls -R processed_docs/
```
Expand All @@ -294,15 +311,16 @@ multimodal_test.pdf.metadata.json
```
You can view the full JSON extracts and the metadata definitions [here](docs/content-metadata.md).

We also provide a script for inspecting [extracted images](src/util/image_viewer.py)
#### We also provide a script for inspecting [extracted images](src/util/image_viewer.py)
```shell
pip install tkinter
python src/util/image_viewer.py --file_path ./processed_docs/image/multimodal_test.pdf.metadata.json
```

Beyond inspecting the results, you can read them into things like [llama-index](examples/llama_index_multimodal_rag.ipynb) or [langchain](examples/langchain_multimodal_rag.ipynb) retrieval pipelines.

Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w/ NVIDIA Ingest.
> [!TIP]
> Beyond inspecting the results, you can read them into things like [llama-index](examples/llama_index_multimodal_rag.ipynb) or [langchain](examples/langchain_multimodal_rag.ipynb) retrieval pipelines.
>
> Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w/ NVIDIA Ingest.
## Repo Structure

Expand Down
Loading

0 comments on commit 9ef3b75

Please sign in to comment.