Skip to content

Commit

Permalink
feat: prometheus (#30)
Browse files Browse the repository at this point in the history
* feat: added prometheus #19

* feat: added pdfa/ua profiles for validate endpoint

* refactoring: unit/integration tests (use fast api test client for all API tests)

* refactoring: load test profile (use same pdf documents for all tests)

* chore: updated documentation
  • Loading branch information
rueedlinger authored Jun 2, 2024
1 parent 185780f commit ef1ab1c
Show file tree
Hide file tree
Showing 40 changed files with 1,448 additions and 717 deletions.
1 change: 0 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ RUN groupadd --gid $USER_GID $USERNAME &&\
apt-get --no-install-recommends install -y libreoffice && \
apt-get install -y default-jre-headless libreoffice-java-common jodconverter

# verapdf
COPY dist/auto-install.xml /tmp
RUN wget -O /tmp/verapdf-installer.zip https://software.verapdf.org/releases/verapdf-installer.zip && \
unzip -d /tmp /tmp/verapdf-installer.zip && \
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@
developer looking to automate PDF processing or integrate PDF functionalities into your existing workflow, Teal provides
a seamless and efficient solution.

## Key Features

- Digitize documents to searchable PDF or archivable PDF/A.
- Extract metadata, text, and tables as structured data.
- Convert different document types to PDF.
- Convert PDFs to PDF/A.
- Check PDF/A compliance.

## Getting Started

### Running Teal in App Mode
Expand Down
2 changes: 1 addition & 1 deletion dist/log_conf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ handlers:

loggers:
teal:
level: DEBUG
level: INFO
handlers: [ console ]
propagate: no
uvicorn.error:
Expand Down
16 changes: 13 additions & 3 deletions dist/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,15 @@ fi


if [ -z ${TEAL_WORKERS+x} ]; then
TEAL_WORKERS=10
TEAL_WORKERS=1
echo "env TEAL_WORKERS is unset, will set to $TEAL_WORKERS"

else
echo "env TEAL_WORKERS is set to '$TEAL_WORKERS'"
if [ "$TEAL_WORKERS" -gt 1 ]; then
export PROMETHEUS_MULTIPROC_DIR="/tmp/prometheus"
echo "running in multi worker mode, creating PROMETHEUS_MULTIPROC_DIR $PROMETHEUS_MULTIPROC_DIR"
mkdir $PROMETHEUS_MULTIPROC_DIR
fi
fi

if [ -z ${TEAL_PORT+x} ]; then
Expand All @@ -51,8 +55,14 @@ fi

if [ "$TEAL_START_LOCUST" = true ] ; then
echo "env $TEAL_START_LOCUST ist set to '$TEAL_START_LOCUST'"

echo "starting in locust"
locust --host http://localhost:$TEAL_PORT --web-port 8089 -f tests/locustfile.py &
if [ -z ${TEAL_LOCUST_PORT+x} ]; then
locust --host http://localhost:$TEAL_PORT --web-port 8089 -f tests/locustfile.py &
else
locust --host http://localhost:$TEAL_PORT --web-port $TEAL_LOCUST_PORT -f tests/locustfile.py &
fi

fi

echo "see API doc http://$TEAL_IP_BIND:$TEAL_PORT/docs"
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ services:
- 8089:8089
environment:
TEAL_LOG_CONF: "log_conf.yaml"
# TEAL_WORKERS: 1
# TEAL_WORKERS: 3
# TEAL_WORKERS_TIMEOUT: 90
TEAL_PORT: 8000
# TEAL_IP_BIND: 0.0.0.0
Expand Down
95 changes: 67 additions & 28 deletions docs/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Python

### Install Python 3.12.0 using pyenv
### Install Python 3.12.0 Using pyenv

This command will download and install Python version 3.12.0. Pyenv is a popular tool for managing multiple versions of
Python on a single system.
Expand All @@ -11,7 +11,7 @@ Python on a single system.
pyenv install 3.12.0
```

### Create a virtual environment named 'teal'
### Create a Virtual Environment Named 'teal'

This command will create a virtual environment using the Python version 3.12.0 that was just installed. Virtual
environments are useful for managing project-specific dependencies.
Expand All @@ -20,7 +20,7 @@ environments are useful for managing project-specific dependencies.
pyenv virtualenv 3.12.0 teal
```

### Activate the virtual environment 'teal'
### Activate the Virtual Environment 'teal'

Activating the virtual environment ensures that any Python commands run will use the packages and interpreter from the '
teal' environment.
Expand All @@ -29,15 +29,30 @@ teal' environment.
pyenv activate teal
```

### Install dependencies from requirements.txt
### Install Dependencies from requirements.txt

This command will install all the necessary packages listed in the requirements.txt file. This file typically contains a
list of all the Python packages required for the project.

```bash
pip install -r requirements.txt
pip install -r requirements.in
```

### Update Dependencies

To update the dependencies, first modify the `requirements.in` file with the desired package versions or additions.
Then, run the following command to generate an updated requirements.txt file:

```bash
pip-compile --output-file=requirements.txt
```

This will ensure that the requirements.txt file is synchronized with the changes made in requirements.in.

> *Note:* *pip-compile* is a tool from the pip-tools package. If you don't have it installed, you can add it using pip
> install pip-tools. For more information, visit
> the [pip-tools documentation](https://pip-tools.readthedocs.io/en/stable/).
### Install Binaries

The easiest way is just to run the app inside the Docker container. This approach ensures that all necessary binaries
Expand Down Expand Up @@ -89,49 +104,73 @@ docker compose up --build

## Unit/Integration Testing

To run the pytest inside the docker container just pass the env `TEAL_TEST_MODE=true`. When you want to pass
arguments to pytest you can use the env `TEAL_PYTEST_ARGS`.
To run pytest inside the Docker container, set the environment variable `TEAL_TEST_MODE=true`. If you need to pass
arguments to pytest, you can use the `TEAL_PYTEST_ARGS` environment variable.

To run pytest without additional arguments, use the following command:

```bash
docker compose run --build --name teal_pytest --rm -e TEAL_TEST_MODE=true teal
docker compose run --build --name teal_pytest \
--rm -e TEAL_TEST_MODE=true teal
```

If you need to pass arguments to pytest, set the `TEAL_PYTEST_ARGS` environment variable. For example, to run tests in
verbose mode, you can use:

```bash
docker compose run --build --name teal_pytest \
--rm -e TEAL_TEST_MODE=true -e TEAL_PYTEST_ARGS="-v" teal
```

## Load Testing

You can run the load test locally or inside docker.
You can run the load test locally or inside a Docker container.

### Locally

The following command will start the load test with locust.
The following command will start the load test with Locust. Note that the application must be running on port 8000 when
you start the load test.

```bash
locust --host http://localhost:8000 --users 5 -t 10m --autostart -f tests/locustfile.py
locust --host http://localhost:8000 --users 5 -t 10m \
--autostart -f tests/locustfile.py
```

You can see the result with the locust webui (http://0.0.0.0:8089/).
You can view the results with the Locust web UI at http://0.0.0.0:8089/.

### Inside Docker

The following command will start the locust webui inside the docker container.
The following command will start the Locust web UI inside the Docker container:

```bash
docker compose run --build --rm -p 8089:8089 -p 8000:8000 -e TEAL_START_LOCUST=true teal
docker compose run --build --rm -p 8089:8089 -p 8000:8000 \
-e TEAL_START_LOCUST=true teal
```

You can now start the load test from the locust webui (http://0.0.0.0:8089/).
The -e `TEAL_START_LOCUST=true` environment variable signals the container to start Locust.

### Result

The following is load test run with 5 users for 10 minutes (10 workers, worker timeout 120 seconds)
on a mac book pro (2023, Apple M2 Max, 64GB Mem) witch docker settings memory limit 16GB and CPU limit 12.
You can now start the load test from the Locust web UI, accessible at http://0.0.0.0:8089/. To begin, navigate to this
URL in your web browser. From the interface, you can configure various test parameters such as the number of users,
spawn rate, and duration of the test. Once your settings are in place, click the "Start" button to initiate the load
test. As the test runs, you can monitor real-time performance metrics and view detailed statistics on response times,
failure rates, and other key indicators. This will help you assess the performance and stability of your application
under load.

| Type | Name | # Requests | # Fails | Median (ms) | 95%ile (ms) | 99%ile (ms) | Average (ms) | Min (ms) | Max (ms) | Average size (bytes) | Current RPS | Current Failures/s |
|------|----------------------|------------|---------|-------------|-------------|-------------|--------------|----------|----------|----------------------|-------------|--------------------|
| POST | /libreoffice/convert | 370 | 0 | 620 | 750 | 940 | 628.05 | 514 | 1197 | 59527.49 | 0.5 | 0 |
| POST | /pdf/ocr | 326 | 0 | 6100 | 8100 | 9400 | 6198.9 | 4101 | 10376 | 5009 | 0.8 | 0 |
| POST | /pdf/table | 342 | 0 | 590 | 690 | 730 | 607.71 | 553 | 808 | 154 | 0.5 | 0 |
| POST | /pdf/text | 336 | 0 | 9 | 18 | 22 | 9.87 | 6 | 84 | 5169 | 1 | 0 |
| POST | /pdfa/convert | 335 | 0 | 360 | 440 | 480 | 367.51 | 316 | 812 | 51695 | 0.6 | 0 |
| POST | /pdfa/validate | 346 | 0 | 1100 | 1300 | 1400 | 1145.63 | 864 | 1543 | 214 | 0.4 | 0 |
| | Aggregated | 2055 | 0 | 600 | 6600 | 7900 | 1452.01 | 6 | 10376 | 20846.44 | 3.8 | 0 |
### Result

The test was performed on a MacBook Pro (2023 model, Apple M2 Max, 64GB
RAM). Docker settings were configured with a memory limit of 16GB and a CPU limit of 12 cores. Please note that the
results obtained from this test may vary based on differences in hardware and software configurations in your setup.

The following load test was conducted with 5 user for a duration of 10 minutes. The test configuration included 1
worker with a timeout of 120 seconds. The PDF document used for all test has a size of 17 KB (16'873 bytes, one page).

| Type | Name | # Requests | # Fails | Median (ms) | 95%ile (ms) | 99%ile (ms) | Average (ms) | Min (ms) | Max (ms) | Average size (bytes) | Current RPS | Current Failures/s |
|----------------|----------------------|------------|---------|-------------|-------------|-------------|--------------|----------|-----------|----------------------|-------------|--------------------|
| POST | /libreoffice/convert | 191 | 0 | 4600 | 8800 | 10000 | 4766.63 | 500 | 12394 | 21297 | 0.2 | 0 |
| POST | /pdf/ocr | 209 | 0 | 1400 | 2600 | 2800 | 1509.29 | 684 | 3499 | 635 | 0.5 | 0 |
| POST | /pdf/table | 212 | 0 | 1300 | 2400 | 2600 | 1370.22 | 558 | 2955 | 2 | 0.1 | 0 |
| POST | /pdf/text | 223 | 0 | 780 | 2000 | 2300 | 783.64 | 2 | 2740 | 654 | 0.4 | 0 |
| POST | /pdfa/convert | 197 | 0 | 4600 | 9300 | 11000 | 4845.74 | 305 | 11176 | 21436 | 0.2 | 0 |
| POST | /pdfa/validate | 213 | 0 | 1500 | 2500 | 3000 | 1522.62 | 767 | 3499 | 171 | 0.5 | 0 |
| **Aggregated** | | **1245** | **0** | **1600** | **6900** | **9500** | **2385.57** | **2** | **12394** | **6912.47** | **1.9** | **0** |
35 changes: 19 additions & 16 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,21 @@

Teal has two modes:

- **app mode** will run the teal app. In app mode you can also start up the Locust webui.
- **test mode** will run the tests and print the result to stdout.
- **APP mode** will run the teal app. In app mode you can also start up the Locust webui.
- **TEST mode** will run the tests and print the result to stdout.

## Running Teal in App Mode

Here's a quick example of how easy it is to work with Teal:

```bash
docker run --pull=always --rm -it -p 8000:8000 --name teal ghcr.io/rueedlinger/teal:main
docker run --pull=always --rm -it -p 8000:8000 \
--name teal ghcr.io/rueedlinger/teal:main
```

Next you can use the api with the openapi ui.

- http://localhost:8000/docs

### Starting Teal with Locust (Load Testing)

Teal also includes Locust load tests, you just need to set the environment variable `TEAL_START_LOCUST=true`.
The following command will start the Locust web UI inside the Docker container.

```bash
docker run --pull=always --rm -it -p 8089:8089 -p 8000:8000 -e TEAL_START_LOCUST=true --name teal ghcr.io/rueedlinger/teal:main
```

You can now start the load test from the locust webui (http://0.0.0.0:8089/).
- [http://localhost:8000/docs](http://localhost:8000/docs)

### Teal REST API Endpoint

Expand Down Expand Up @@ -135,11 +125,24 @@ curl -X 'POST' \
-F '[email protected];type=application/vnd.openxmlformats-officedocument.wordprocessingml.document'
```

### Starting Teal with Locust (Load Testing)

Teal also includes Locust load tests, you just need to set the environment variable `TEAL_START_LOCUST=true`.
The following command will start the Locust web UI inside the Docker container.

```bash
docker run --pull=always --rm -it -p 8089:8089 -p 8000:8000 \
-e TEAL_START_LOCUST=true --name teal ghcr.io/rueedlinger/teal:main
```

You can now start the load test from the locust webui [http://localhost:8089/](http://localhost:8089/).

## Running Teal in Test Mode

Teal is packed with unit and integration tests, you just need to set the environment varaible `TEAL_TEST_MODE=true`.
These tests can be run and verified with teh following command.

```bash
docker run --pull=always --rm -it -p 8000:8000 -e TEAL_TEST_MODE=true --name teal ghcr.io/rueedlinger/teal:main
docker run --pull=always --rm -it -p 8000:8000 \
-e TEAL_TEST_MODE=true --name teal ghcr.io/rueedlinger/teal:main
```
78 changes: 60 additions & 18 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@
developer looking to automate PDF processing or integrate PDF functionalities into your existing workflow, Teal provides
a seamless and efficient solution.

For the source code, see [https://github.com/rueedlinger/teal](https://github.com/rueedlinger/teal).

## Key Features

- Digitize documents to searchable PDF or archivable PDF/A.
- Extract metadata, text, and tables as structured data.
- Convert different document types to PDF.
- Convert PDFs to PDF/A.
- Check PDF/A compliance.

## Understanding Different Types of PDFs

**Digitally Created PDFs:**
Expand All @@ -24,25 +34,57 @@ a seamless and efficient solution.
- Have a text layer added underneath the image layer, making them fully searchable.
- Text can be selected, copied, and marked up like in original documents.

## Key Features
## Libraries and Binaries Used in Teal

- Digitize documents to searchable or archivable PDF (PDF/A).
- Extract metadata, text, and tables as structured data.
- Convert different document types to PDF.
- Convert PDFs to PDF/A.
- Check PDF/A compliance.
**Teal** uses other open-source libraries and provides this functionality through convenient APIs.

## Libraries Used in Teal
**Docker Base Image**

**Teal** uses other open-source libraries and provides this functionality through convenient APIs.
Currently `python:3.12` is used as Docker base image.

**Python Libraries:**

The following python packages are defiend in the `requirements.in`file.

```text
fastapi
prometheus-fastapi-instrumentator
python-multipart
uvicorn
gunicorn
pyyaml
pypdfium2
pytesseract
pdf2image
camelot-py
# needed by camelot-py
ghostscript
# needed by camelot-py
opencv-python
PyPDF2
pytest
pytest-cov
locust
black
```

You can generate the full list of dependencies with `pip-compile` (
see [pip-compile](https://pip-tools.readthedocs.io/en/stable/)).

**Binaries:**

The following binaries (debian packages) are needed:

- tesseract-ocr
- tesseract-ocr-eng (and additional required languages)
- poppler-utils
- ocrmypdf
- ghostscript
- python3-tk
- libgl1
- libreoffice
- default-jre-headless
- libreoffice-java-common
- jodconverter

| Feature | Library |
|---------------------------------------------------|-------------------------|
| Extract text from PDFs | pypdfium2 |
| Extract text from scanned PDFs (OCR) | pytesseract |
| Extract tables from PDFs | camelot |
| Convert PDF to PDF/A (with OCR when no text) | ocrmypdf |
| Convert Office documents to PDF | libreoffice |
| PDF/A validation | veraPDF |
| Extract meta data from PDF | **not yet implemented** |
| Process documents from a remote repository (HTTP) | **not yet implemented** |
For more details have a look at the Docker file.
Loading

0 comments on commit ef1ab1c

Please sign in to comment.