-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial stable release - v. 0.1.0 (#2)
* initial commit [WIP] * added original tika parser from cogstack pipeline as a baseline * further works on a new parser, cleanup * added Dockerfile to build tika server image * working on config * working on config * config of legacy parser + minor refactoring * minor refactoring + debugging on the PDF/OCR * added use of legacy parser for single-page documents; updated dockerfile; minor refactoring * added /info endpoint to display configuration * minor refactor * adding tests * fixed a bug of not recognising properly X-OCR-Applied flag * added more tests and test files; minor refactor * added tests for composite processor; minor tests reorg * adding tests for controller * added more tests for controller (stream, multipart) * added support for control over failing on empty / incorrect document types; moved all properties to one application.yaml (limitations of spring) * Update README.md * build: adding support for travis CI building * buiid: fixing travis dependencies * build: print more verbose info about failed tests * build: print more verbose info about failed tests * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * fixing TravisCI failed tests (ImageMagick policy) * fixing TravisCI * fixing travis * another attempt to fix the travis build * debugging failing travis * debugging travis build -- imagemagick * travis build script cleanup * Update travis_gradle_build.sh * added run.sh to run the service in the (updated) Dockerfile * code cleanup + code documenting * added test on handling empty files * Update README.md * Update README.md * added inclusion of document processed timestamp * minor renaming - for consistency * fixing datetime json de/serialization * added changelog * Update README.md * version bump in yaml config files * Update README.md * proper version bump --> 0.1.0
- Loading branch information
Showing
59 changed files
with
3,304 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,6 @@ | ||
# Compiled class file | ||
*.class | ||
.DS_Store | ||
.idea | ||
.gradle | ||
|
||
# Log file | ||
*.log | ||
|
||
# BlueJ files | ||
*.ctxt | ||
|
||
# Mobile Tools for Java (J2ME) | ||
.mtj.tmp/ | ||
|
||
# Package Files # | ||
*.jar | ||
*.war | ||
*.nar | ||
*.ear | ||
*.zip | ||
*.tar.gz | ||
*.rar | ||
|
||
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml | ||
hs_err_pid* | ||
build | ||
out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
dist: xenial | ||
|
||
language: java | ||
|
||
jdk: | ||
- openjdk11 | ||
|
||
env: | ||
# limit the number of processing theads used by tesseract | ||
- OMP_THREAD_LIMIT=1 | ||
|
||
addons: | ||
apt: | ||
sources: | ||
# tesseract-ocr >= 4.0 is not available in the standard Xenial / Trusty distro | ||
- sourceline: 'ppa:alex-p/tesseract-ocr' | ||
packages: | ||
- tesseract-ocr | ||
- tesseract-ocr-osd | ||
- tesseract-ocr-eng | ||
- imagemagick | ||
- ghostscript | ||
- libtesseract-dev | ||
- libmagickcore-dev | ||
- libmagickwand-dev | ||
- libmagic-dev | ||
- apache2-utils | ||
|
||
before_cache: | ||
- rm -f $HOME/.gradle/caches/modules-2/modules-2.lock | ||
- rm -fr $HOME/.gradle/caches/*/plugin-resolution/ | ||
- rm -fr $HOME/.gradle/caches/*/scripts/ | ||
|
||
cache: | ||
directories: | ||
- $HOME/.gradle/caches/ | ||
- $HOME/.gradle/wrapper/ | ||
|
||
install: | ||
- sudo cp ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml | ||
|
||
before_script: | ||
- convert --version | ||
# - convert -list policy | ||
- tesseract --version | ||
# - ./gradlew downloadDependencies > /dev/null | ||
|
||
script: | ||
- bash travis_gradle_build.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Release 0.1.0 -- 15 Aug 2019 | ||
--------------- | ||
* Initial stable version release |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
################################################################ | ||
# | ||
# BUILD STEPS | ||
# | ||
|
||
################################ | ||
# | ||
# JDK base | ||
# | ||
FROM adoptopenjdk/openjdk11:slim AS jdk-11-base | ||
|
||
# freeze the versions of the Tesseract+ImageMagick for reproducibility | ||
ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2 | ||
ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2 | ||
ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7 | ||
|
||
RUN apt-get update && \ | ||
# apt-get dist-upgrade -y && \ | ||
# apt-get install -y tesseract-ocr && \ | ||
apt-get update && \ | ||
apt-get install -y software-properties-common && \ | ||
apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \ | ||
### apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \ | ||
apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \ | ||
apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \ | ||
apt-get clean autoclean && \ | ||
apt-get autoremove -y && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
|
||
################################ | ||
# | ||
# Tika Server Builder | ||
# | ||
FROM jdk-11-base AS service-builder | ||
|
||
# setup the build environment | ||
RUN mkdir -p /devel | ||
WORKDIR /devel | ||
|
||
COPY ./gradle/wrapper /devel/gradle/wrapper | ||
COPY ./gradlew /devel/ | ||
|
||
RUN ./gradlew --version | ||
|
||
COPY ./settings.gradle /devel/ | ||
COPY . /devel/ | ||
|
||
# build service | ||
# TIP: uncomment the two lines below to both build the service | ||
# and run the tests during the build | ||
#COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml | ||
#RUN ./gradlew build --no-daemon | ||
|
||
RUN ./gradlew bootJar --no-daemon | ||
|
||
|
||
|
||
################################################################ | ||
# | ||
# RUN STEPS | ||
# | ||
|
||
################################ | ||
# | ||
# JRE base | ||
# | ||
FROM adoptopenjdk/openjdk11:jre AS jre-11-base | ||
|
||
# freeze the versions of the Tesseract+ImageMagick for reproducibility | ||
ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2 | ||
ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2 | ||
ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7 | ||
|
||
RUN apt-get update && \ | ||
# apt-get dist-upgrade -y && \ | ||
# apt-get install -y tesseract-ocr && \ | ||
apt-get update && \ | ||
apt-get install -y software-properties-common && \ | ||
apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \ | ||
### apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \ | ||
apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \ | ||
apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \ | ||
apt-get clean autoclean && \ | ||
apt-get autoremove -y && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
|
||
################################ | ||
# | ||
# Tika Service | ||
# | ||
FROM jre-11-base AS service-runner | ||
|
||
# setup env | ||
RUN mkdir -p /app/config | ||
WORKDIR /app | ||
|
||
# copy tika-server artifacts | ||
COPY --from=service-builder /devel/build/libs/service-*.jar ./ | ||
COPY --from=service-builder /devel/src/main/resources/application.yaml ./config/ | ||
|
||
COPY --from=service-builder /devel/scripts/run.sh ./ | ||
|
||
# copy external tools configuration files | ||
COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml | ||
|
||
# entry point | ||
CMD ["/bin/bash", "/app/run.sh"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,162 @@ | ||
# Introduction | ||
Apache Tika running as a web service | ||
This project implements Apache Tika running as a web service using Spring Boot. It exposes a REST API so that a client can send a document in binary format and receive back the extracted text. The supported document formats are the ones as in Tika. | ||
|
||
# Status | ||
Work-in-progress ... | ||
Some of the key motivation behind developing own wrapper over Tika instead of using the already availabke [Tika server](https://cwiki.apache.org/confluence/display/tika/TikaJAXRS) is a better control over used document parsers (such as PDFParser, Tesseract OCR and the legacy one taken from [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline)) and control over returned results with HTTP return codes. | ||
|
||
|
||
# Building | ||
To build the application, run in the main directory: | ||
|
||
`./gradlew build` | ||
|
||
The build artifacts will be placed in `./build` directory. | ||
|
||
|
||
During the build, the tests will be run, where the failed tests can also signify missing third-party dependencies (see below). However, to skip running the tests and just build the application, one can run: | ||
|
||
`./gradlew bootJar`. | ||
|
||
|
||
## Tests | ||
To run the available tests, run: | ||
|
||
`./gradlew test` | ||
|
||
Please note that failed tests may signify missing third-party dependencies. | ||
|
||
|
||
## Third-party dependencies | ||
In the minimal setup, for proper text extraction Apache Tika requires the following applications to be present on the system: | ||
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract), | ||
- [ImageMagick](https://imagemagick.org), | ||
- [Ghostscript](https://www.ghostscript.com/) (required by ImageMagick for documents conversion). | ||
|
||
ImageMagick also requires its configuration file `policy.xml` to be overriden by the provided `extras/ImageMagick/policy.xml` (in order to increase the the available resources for file processing and to override [security policy](https://stackoverflow.com/questions/52703123/override-default-imagemagick-policy-xml) related with Ghostscript). | ||
|
||
Moreover, in order to enable additional image processing capabilities of Tesseract OCR, few other dependencies need to be present in the system, such as Python environment. Please see the provided `Dockerfile` for the full list. | ||
|
||
|
||
# Running the application | ||
The application can be either run as a standalone Java application or inside a Docker container. The application configuration can be changed in the `application.yaml` file. The default version of configuration file is embeded in the jar file, but can be specified manually (see below). | ||
|
||
Please note that the recommended way is to use the provided Docker image since a number of dependencies need to be satisfied on a local machine. | ||
|
||
|
||
## Running as a standalone Java application | ||
Assuming that the build went correctly, to run the Tika service on a local machine: | ||
|
||
`java -jar build/jar/service-*.jar` | ||
|
||
The running service will be listening on port `8090` (by default) on the host machine. | ||
|
||
|
||
## Using the Docker image | ||
The latest stable Docker image is available in the Docker Hub under `cogstacksystems/tika-service:latest` tag. Alternatively, the latest development version is available under `cogstacksystems/tika-service:dev-latest` tag. The image can be also build locally using the provided `Dockerfile`. | ||
|
||
|
||
To run Tika service container: | ||
|
||
`docker run -p 8090:8090 cogstacksystems/tika-service:latest` | ||
|
||
The service will be listening on port `8090` on the host machine. | ||
|
||
|
||
# API | ||
|
||
## API specification | ||
Tika Service, by default, will be listening on port `8090` and the returned content extraction result will be represented in JSON format. | ||
|
||
The service exposes such endpoints: | ||
- *GET* `/api/info` - returns information about the service with its configuration, | ||
- *POST* `/api/process` - processes a binary data stream with the binary document content, | ||
- *POST* `/api/process_file` - processes a document file (multi-part request). | ||
|
||
## Document extraction result | ||
The extraction results are represented in JSON format where the available main fields are: | ||
- `result` - the content extraction result with metadata, | ||
- `timestamp` - the content processing timestamp, | ||
- `success` - specifies whether the extraction accomplished successfully, | ||
- `error` - the message in case of processing error (assumes `success : false`). | ||
|
||
The content extraction result can contain such fields: | ||
- `text` - the extracted text, | ||
- `metadata` - the metadata associated with the document and the used parsers. | ||
|
||
The provided metadata associated with the document and the used parsers can include such fields: | ||
- `X-Parsed-By` - an array of names of the parsers used during the content extraction, | ||
- `X-OCR-Applied` - a flag specifying whether OCR was applied, | ||
- `Content-Type` - the content type of the document, as identified by Tika, | ||
- `Page-Count` - the document page count (extracted from the document metadata by Tika), | ||
- `Creation-Date` - the document creation date (extracted from the document metadata by Tika). | ||
|
||
|
||
# Example use | ||
Using `curl` to send the document to Tika server instance running on localhost on `8090` port: | ||
|
||
`curl -F [email protected] http://localhost:8090/api/process_file | jq` | ||
|
||
Returned result: | ||
``` | ||
{ | ||
"result": { | ||
"text": "Sample Type / Medical Specialty: Lab Medicine - Pathology", | ||
"metadata": { | ||
"X-Parsed-By": [ | ||
"org.apache.tika.parser.CompositeParser", | ||
"org.apache.tika.parser.DefaultParser", | ||
"org.apache.tika.parser.microsoft.ooxml.OOXMLParser" | ||
], | ||
"X-OCR-Applied": "false", | ||
"Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
"success": true, | ||
"timestamp": "2019-08-13T15:14:58.022+01:00" | ||
} | ||
} | ||
``` | ||
|
||
# Configuration | ||
|
||
## Configuration file | ||
All the available service and document processors parameteres are stored in a single `src/main/resources/application.yaml` file. | ||
|
||
Although the initial configuration file is bundled with the application jar file, a modified one can be provided as a parameter when running the Java application. For example, when running the Tika service in the Docker container, the script `scripts/run.sh` runs the Tika service with custom configuration file `application.yaml` located in `/app/config/` directory: | ||
`java -Dspring.config.location=/app/config/ -jar /app/service-*.jar` | ||
|
||
|
||
## Available properties | ||
The configuration file is stored in yaml format with the following available properties. | ||
|
||
### General application properties | ||
- `application.version` - specifies the application version, | ||
- `server.port` - the port number on which the service will be run (default: `8090`), | ||
- `spring.servlet.multipart.max-file-size` and `spring.servlet.multipart.max-request-size` - specifies the max file size when processing file requests (default: `100MB`). | ||
|
||
|
||
### Tika service configuration | ||
The following keys reside under `tika.processing` node: | ||
- `use-legacy-tika-processor-as-default` - whether to use the legacy Tika PDF parser (as used in CogStack Pipeline) for backward compatibility (default: `true`), | ||
- `fail-on-empty-files` - whether to fail the request and report an error when client provided an empty document (default: `false`), | ||
- `fail-on-non-document-types` - whether to fail the request and report an erorr when client provided a not supported and/or non-document content (default: `true`). | ||
|
||
|
||
### Tika parsers configuration | ||
The following keys reside under `tika.parsers` node. | ||
|
||
The keys under `tesseract-ocr` define the default behavior of the Tika Tesseract OCR parser: | ||
- `language` - the language dictionary used by Tesseract (default: `eng`), | ||
- `timeout` - the max time (ms) to process documents before reporting error (default: `300`), | ||
- `enable-image-processing` - whether to use additional pre-processing of the images using ImageMagick (default: `false`), | ||
- `apply-rotation` - whether to apply de-rotating of the images (default: `false`), | ||
Please note that enabling `enable-image-processing` and/or `apply-rotation` although might improve the quality of the extracted text can significantly slower the extraction process. | ||
|
||
The keys under `pdf-ocr-parser` define the default behavior of the PDF parser that uses Tesseract OCR to extract the text: | ||
- `ocr-only-strategy` - whether to use only OCR or to apply additional text extraction from the content (default: `true`), | ||
- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`), | ||
- `min-doc-byte-size` - the minimum size of the image data (in bytes) that should have the content to be extracted, otherwise is skipped (default: `10000`), | ||
- `use-legacy-ocr-parser-for-single-page-doc` - in case of single-page PDF documents, whether to use the legacy parser (default: `false`). | ||
|
||
The keys under `legacy-pdf-parser` define the behavior of the Tika PDF parser used in CogStack Pipeline (the 'legacy' parser), that is used for backward compatibility: | ||
- `image-magick.timeout` - the max timeout value (in ms) when performing document conversion using ImageMagick (default: `300`), | ||
- `tesseract-ocr.timeout` - the max timeout value (in ms) when performing text extraction using Tesseract OCR (default: `300`), | ||
- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`). |
Oops, something went wrong.