Skip to content

Commit

Permalink
Initial stable release - v. 0.1.0 (#2)
Browse files Browse the repository at this point in the history
* initial commit [WIP]

* added original tika parser from cogstack pipeline as a baseline

* further works on a new parser, cleanup

* added Dockerfile to build tika server image

* working on config

* working on config

* config of legacy parser + minor refactoring

* minor refactoring + debugging on the PDF/OCR

* added use of legacy parser for single-page documents; updated dockerfile; minor refactoring

* added /info endpoint to display configuration

* minor refactor

* adding tests

* fixed a bug of not recognising properly X-OCR-Applied flag

* added more tests and test files; minor refactor

* added tests for composite processor; minor tests reorg

* adding tests for controller

* added more tests for controller (stream, multipart)

* added support for control over failing on empty / incorrect document types; moved all properties to one application.yaml (limitations of spring)

* Update README.md

* build: adding support for travis CI building

* buiid: fixing travis dependencies

* build: print more verbose info about failed tests

* build: print more verbose info about failed tests

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* build: hunting for failed tests error cause

* fixing TravisCI failed tests (ImageMagick policy)

* fixing TravisCI

* fixing travis

* another attempt to fix the travis build

* debugging failing travis

* debugging travis build -- imagemagick

* travis build script cleanup

* Update travis_gradle_build.sh

* added run.sh to run the service in the (updated) Dockerfile

* code cleanup + code documenting

* added test on handling empty files

* Update README.md

* Update README.md

* added inclusion of document processed timestamp

* minor renaming - for consistency

* fixing datetime json de/serialization

* added changelog

* Update README.md

* version bump in yaml config files

* Update README.md

* proper version bump --> 0.1.0
  • Loading branch information
lrog authored Aug 15, 2019
1 parent 97601e1 commit 56727da
Show file tree
Hide file tree
Showing 59 changed files with 3,304 additions and 25 deletions.
27 changes: 5 additions & 22 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,23 +1,6 @@
# Compiled class file
*.class
.DS_Store
.idea
.gradle

# Log file
*.log

# BlueJ files
*.ctxt

# Mobile Tools for Java (J2ME)
.mtj.tmp/

# Package Files #
*.jar
*.war
*.nar
*.ear
*.zip
*.tar.gz
*.rar

# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*
build
out
49 changes: 49 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
dist: xenial

language: java

jdk:
- openjdk11

env:
# limit the number of processing theads used by tesseract
- OMP_THREAD_LIMIT=1

addons:
apt:
sources:
# tesseract-ocr >= 4.0 is not available in the standard Xenial / Trusty distro
- sourceline: 'ppa:alex-p/tesseract-ocr'
packages:
- tesseract-ocr
- tesseract-ocr-osd
- tesseract-ocr-eng
- imagemagick
- ghostscript
- libtesseract-dev
- libmagickcore-dev
- libmagickwand-dev
- libmagic-dev
- apache2-utils

before_cache:
- rm -f $HOME/.gradle/caches/modules-2/modules-2.lock
- rm -fr $HOME/.gradle/caches/*/plugin-resolution/
- rm -fr $HOME/.gradle/caches/*/scripts/

cache:
directories:
- $HOME/.gradle/caches/
- $HOME/.gradle/wrapper/

install:
- sudo cp ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml

before_script:
- convert --version
# - convert -list policy
- tesseract --version
# - ./gradlew downloadDependencies > /dev/null

script:
- bash travis_gradle_build.sh
3 changes: 3 additions & 0 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Release 0.1.0 -- 15 Aug 2019
---------------
* Initial stable version release
109 changes: 109 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
################################################################
#
# BUILD STEPS
#

################################
#
# JDK base
#
FROM adoptopenjdk/openjdk11:slim AS jdk-11-base

# freeze the versions of the Tesseract+ImageMagick for reproducibility
ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2
ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2
ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7

RUN apt-get update && \
# apt-get dist-upgrade -y && \
# apt-get install -y tesseract-ocr && \
apt-get update && \
apt-get install -y software-properties-common && \
apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \
### apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \
apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \
apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \
apt-get clean autoclean && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*


################################
#
# Tika Server Builder
#
FROM jdk-11-base AS service-builder

# setup the build environment
RUN mkdir -p /devel
WORKDIR /devel

COPY ./gradle/wrapper /devel/gradle/wrapper
COPY ./gradlew /devel/

RUN ./gradlew --version

COPY ./settings.gradle /devel/
COPY . /devel/

# build service
# TIP: uncomment the two lines below to both build the service
# and run the tests during the build
#COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml
#RUN ./gradlew build --no-daemon

RUN ./gradlew bootJar --no-daemon



################################################################
#
# RUN STEPS
#

################################
#
# JRE base
#
FROM adoptopenjdk/openjdk11:jre AS jre-11-base

# freeze the versions of the Tesseract+ImageMagick for reproducibility
ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2
ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2
ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7

RUN apt-get update && \
# apt-get dist-upgrade -y && \
# apt-get install -y tesseract-ocr && \
apt-get update && \
apt-get install -y software-properties-common && \
apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \
### apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \
apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \
apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \
apt-get clean autoclean && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*


################################
#
# Tika Service
#
FROM jre-11-base AS service-runner

# setup env
RUN mkdir -p /app/config
WORKDIR /app

# copy tika-server artifacts
COPY --from=service-builder /devel/build/libs/service-*.jar ./
COPY --from=service-builder /devel/src/main/resources/application.yaml ./config/

COPY --from=service-builder /devel/scripts/run.sh ./

# copy external tools configuration files
COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml

# entry point
CMD ["/bin/bash", "/app/run.sh"]
163 changes: 160 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,162 @@
# Introduction
Apache Tika running as a web service
This project implements Apache Tika running as a web service using Spring Boot. It exposes a REST API so that a client can send a document in binary format and receive back the extracted text. The supported document formats are the ones as in Tika.

# Status
Work-in-progress ...
Some of the key motivation behind developing own wrapper over Tika instead of using the already availabke [Tika server](https://cwiki.apache.org/confluence/display/tika/TikaJAXRS) is a better control over used document parsers (such as PDFParser, Tesseract OCR and the legacy one taken from [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline)) and control over returned results with HTTP return codes.


# Building
To build the application, run in the main directory:

`./gradlew build`

The build artifacts will be placed in `./build` directory.


During the build, the tests will be run, where the failed tests can also signify missing third-party dependencies (see below). However, to skip running the tests and just build the application, one can run:

`./gradlew bootJar`.


## Tests
To run the available tests, run:

`./gradlew test`

Please note that failed tests may signify missing third-party dependencies.


## Third-party dependencies
In the minimal setup, for proper text extraction Apache Tika requires the following applications to be present on the system:
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract),
- [ImageMagick](https://imagemagick.org),
- [Ghostscript](https://www.ghostscript.com/) (required by ImageMagick for documents conversion).

ImageMagick also requires its configuration file `policy.xml` to be overriden by the provided `extras/ImageMagick/policy.xml` (in order to increase the the available resources for file processing and to override [security policy](https://stackoverflow.com/questions/52703123/override-default-imagemagick-policy-xml) related with Ghostscript).

Moreover, in order to enable additional image processing capabilities of Tesseract OCR, few other dependencies need to be present in the system, such as Python environment. Please see the provided `Dockerfile` for the full list.


# Running the application
The application can be either run as a standalone Java application or inside a Docker container. The application configuration can be changed in the `application.yaml` file. The default version of configuration file is embeded in the jar file, but can be specified manually (see below).

Please note that the recommended way is to use the provided Docker image since a number of dependencies need to be satisfied on a local machine.


## Running as a standalone Java application
Assuming that the build went correctly, to run the Tika service on a local machine:

`java -jar build/jar/service-*.jar`

The running service will be listening on port `8090` (by default) on the host machine.


## Using the Docker image
The latest stable Docker image is available in the Docker Hub under `cogstacksystems/tika-service:latest` tag. Alternatively, the latest development version is available under `cogstacksystems/tika-service:dev-latest` tag. The image can be also build locally using the provided `Dockerfile`.


To run Tika service container:

`docker run -p 8090:8090 cogstacksystems/tika-service:latest`

The service will be listening on port `8090` on the host machine.


# API

## API specification
Tika Service, by default, will be listening on port `8090` and the returned content extraction result will be represented in JSON format.

The service exposes such endpoints:
- *GET* `/api/info` - returns information about the service with its configuration,
- *POST* `/api/process` - processes a binary data stream with the binary document content,
- *POST* `/api/process_file` - processes a document file (multi-part request).

## Document extraction result
The extraction results are represented in JSON format where the available main fields are:
- `result` - the content extraction result with metadata,
- `timestamp` - the content processing timestamp,
- `success` - specifies whether the extraction accomplished successfully,
- `error` - the message in case of processing error (assumes `success : false`).

The content extraction result can contain such fields:
- `text` - the extracted text,
- `metadata` - the metadata associated with the document and the used parsers.

The provided metadata associated with the document and the used parsers can include such fields:
- `X-Parsed-By` - an array of names of the parsers used during the content extraction,
- `X-OCR-Applied` - a flag specifying whether OCR was applied,
- `Content-Type` - the content type of the document, as identified by Tika,
- `Page-Count` - the document page count (extracted from the document metadata by Tika),
- `Creation-Date` - the document creation date (extracted from the document metadata by Tika).


# Example use
Using `curl` to send the document to Tika server instance running on localhost on `8090` port:

`curl -F [email protected] http://localhost:8090/api/process_file | jq`

Returned result:
```
{
"result": {
"text": "Sample Type / Medical Specialty: Lab Medicine - Pathology",
"metadata": {
"X-Parsed-By": [
"org.apache.tika.parser.CompositeParser",
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.microsoft.ooxml.OOXMLParser"
],
"X-OCR-Applied": "false",
"Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
},
"success": true,
"timestamp": "2019-08-13T15:14:58.022+01:00"
}
}
```

# Configuration

## Configuration file
All the available service and document processors parameteres are stored in a single `src/main/resources/application.yaml` file.

Although the initial configuration file is bundled with the application jar file, a modified one can be provided as a parameter when running the Java application. For example, when running the Tika service in the Docker container, the script `scripts/run.sh` runs the Tika service with custom configuration file `application.yaml` located in `/app/config/` directory:
`java -Dspring.config.location=/app/config/ -jar /app/service-*.jar`


## Available properties
The configuration file is stored in yaml format with the following available properties.

### General application properties
- `application.version` - specifies the application version,
- `server.port` - the port number on which the service will be run (default: `8090`),
- `spring.servlet.multipart.max-file-size` and `spring.servlet.multipart.max-request-size` - specifies the max file size when processing file requests (default: `100MB`).


### Tika service configuration
The following keys reside under `tika.processing` node:
- `use-legacy-tika-processor-as-default` - whether to use the legacy Tika PDF parser (as used in CogStack Pipeline) for backward compatibility (default: `true`),
- `fail-on-empty-files` - whether to fail the request and report an error when client provided an empty document (default: `false`),
- `fail-on-non-document-types` - whether to fail the request and report an erorr when client provided a not supported and/or non-document content (default: `true`).


### Tika parsers configuration
The following keys reside under `tika.parsers` node.

The keys under `tesseract-ocr` define the default behavior of the Tika Tesseract OCR parser:
- `language` - the language dictionary used by Tesseract (default: `eng`),
- `timeout` - the max time (ms) to process documents before reporting error (default: `300`),
- `enable-image-processing` - whether to use additional pre-processing of the images using ImageMagick (default: `false`),
- `apply-rotation` - whether to apply de-rotating of the images (default: `false`),
Please note that enabling `enable-image-processing` and/or `apply-rotation` although might improve the quality of the extracted text can significantly slower the extraction process.

The keys under `pdf-ocr-parser` define the default behavior of the PDF parser that uses Tesseract OCR to extract the text:
- `ocr-only-strategy` - whether to use only OCR or to apply additional text extraction from the content (default: `true`),
- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`),
- `min-doc-byte-size` - the minimum size of the image data (in bytes) that should have the content to be extracted, otherwise is skipped (default: `10000`),
- `use-legacy-ocr-parser-for-single-page-doc` - in case of single-page PDF documents, whether to use the legacy parser (default: `false`).

The keys under `legacy-pdf-parser` define the behavior of the Tika PDF parser used in CogStack Pipeline (the 'legacy' parser), that is used for backward compatibility:
- `image-magick.timeout` - the max timeout value (in ms) when performing document conversion using ImageMagick (default: `300`),
- `tesseract-ocr.timeout` - the max timeout value (in ms) when performing text extraction using Tesseract OCR (default: `300`),
- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`).
Loading

0 comments on commit 56727da

Please sign in to comment.