Initial stable release - v. 0.1.0 (#2)

* initial commit [WIP] * added original tika parser from cogstack pipeline as a baseline * further works on a new parser, cleanup * added Dockerfile to build tika server image * working on config * working on config * config of legacy parser + minor refactoring * minor refactoring + debugging on the PDF/OCR * added use of legacy parser for single-page documents; updated dockerfile; minor refactoring * added /info endpoint to display configuration * minor refactor * adding tests * fixed a bug of not recognising properly X-OCR-Applied flag * added more tests and test files; minor refactor * added tests for composite processor; minor tests reorg * adding tests for controller * added more tests for controller (stream, multipart) * added support for control over failing on empty / incorrect document types; moved all properties to one application.yaml (limitations of spring) * Update README.md * build: adding support for travis CI building * buiid: fixing travis dependencies * build: print more verbose info about failed tests * build: print more verbose info about failed tests * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * build: hunting for failed tests error cause * fixing TravisCI failed tests (ImageMagick policy) * fixing TravisCI * fixing travis * another attempt to fix the travis build * debugging failing travis * debugging travis build -- imagemagick * travis build script cleanup * Update travis_gradle_build.sh * added run.sh to run the service in the (updated) Dockerfile * code cleanup + code documenting * added test on handling empty files * Update README.md * Update README.md * added inclusion of document processed timestamp * minor renaming - for consistency * fixing datetime json de/serialization * added changelog * Update README.md * version bump in yaml config files * Update README.md * proper version bump --> 0.1.0
CogStack · Aug 15, 2019 · 56727da · 56727da
1 parent 97601e1
commit 56727da
Show file tree

Hide file tree

Showing 59 changed files with 3,304 additions and 25 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,23 +1,6 @@
-# Compiled class file
-*.class
+.DS_Store
+.idea
+.gradle
 
-# Log file
-*.log
-
-# BlueJ files
-*.ctxt
-
-# Mobile Tools for Java (J2ME)
-.mtj.tmp/
-
-# Package Files #
-*.jar
-*.war
-*.nar
-*.ear
-*.zip
-*.tar.gz
-*.rar
-
-# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
-hs_err_pid*
+build
+out
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,49 @@
+dist: xenial
+
+language: java
+
+jdk:
+  - openjdk11
+
+env:
+  # limit the number of processing theads used by tesseract
+  - OMP_THREAD_LIMIT=1
+
+addons:
+  apt:
+    sources:
+    # tesseract-ocr >= 4.0 is not available in the standard Xenial / Trusty distro
+    - sourceline: 'ppa:alex-p/tesseract-ocr'
+    packages:
+      - tesseract-ocr
+      - tesseract-ocr-osd
+      - tesseract-ocr-eng
+      - imagemagick
+      - ghostscript
+      - libtesseract-dev
+      - libmagickcore-dev
+      - libmagickwand-dev
+      - libmagic-dev
+      - apache2-utils
+
+before_cache:
+  - rm -f  $HOME/.gradle/caches/modules-2/modules-2.lock
+  - rm -fr $HOME/.gradle/caches/*/plugin-resolution/
+  - rm -fr $HOME/.gradle/caches/*/scripts/
+
+cache:
+  directories:
+    - $HOME/.gradle/caches/
+    - $HOME/.gradle/wrapper/
+
+install:
+  - sudo cp ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml
+
+before_script:
+  - convert --version
+#  - convert -list policy
+  - tesseract --version
+#  - ./gradlew downloadDependencies > /dev/null
+
+script:
+  - bash travis_gradle_build.sh
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -0,0 +1,3 @@
+Release 0.1.0 -- 15 Aug 2019
+---------------
+* Initial stable version release
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,109 @@
+################################################################
+#
+# BUILD STEPS
+#
+
+################################
+#
+# JDK base
+#
+FROM adoptopenjdk/openjdk11:slim AS jdk-11-base
+
+# freeze the versions of the Tesseract+ImageMagick for reproducibility
+ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2
+ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2
+ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7
+
+RUN apt-get update && \
+#	apt-get dist-upgrade -y && \
+#	apt-get install -y tesseract-ocr && \
+    apt-get update && \
+    apt-get install -y software-properties-common && \
+	apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \
+###	apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \
+	apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \
+	apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \
+	apt-get clean autoclean && \
+    apt-get autoremove -y && \
+    rm -rf /var/lib/apt/lists/*
+
+
+################################
+#
+# Tika Server Builder
+#
+FROM jdk-11-base AS service-builder
+
+# setup the build environment
+RUN mkdir -p /devel
+WORKDIR /devel
+
+COPY ./gradle/wrapper /devel/gradle/wrapper
+COPY ./gradlew /devel/
+
+RUN ./gradlew --version
+
+COPY ./settings.gradle /devel/
+COPY . /devel/
+
+# build service
+# TIP: uncomment the two lines below to both build the service
+#      and run the tests during the build
+#COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml
+#RUN ./gradlew build --no-daemon
+
+RUN ./gradlew bootJar --no-daemon
+
+
+
+################################################################
+#
+# RUN STEPS
+#
+
+################################
+#
+# JRE base
+#
+FROM adoptopenjdk/openjdk11:jre AS jre-11-base
+
+# freeze the versions of the Tesseract+ImageMagick for reproducibility
+ENV TESSERACT_VERSION 4.00~git2288-10f4998a-2
+ENV TESSERACT_RES_VERSION 4.00~git24-0e00fe6-1.2
+ENV IMAGEMAGICK_VERSION 8:6.9.7.4+dfsg-16ubuntu6.7
+
+RUN apt-get update && \
+#	apt-get dist-upgrade -y && \
+#	apt-get install -y tesseract-ocr && \
+    apt-get update && \
+    apt-get install -y software-properties-common && \
+	apt-get install -y tesseract-ocr=$TESSERACT_VERSION tesseract-ocr-eng=$TESSERACT_RES_VERSION tesseract-ocr-osd=$TESSERACT_RES_VERSION && \
+###	apt-get install -y tesseract-ocr-osd=3.04.00-1 tesseract-ocr-eng=3.04.00-1 tesseract-ocr=3.04.01-5 && \
+	apt-get install -y imagemagick=$IMAGEMAGICK_VERSION --fix-missing && \
+	apt-get install -y python3-pip && pip3 install numpy matplotlib scikit-image && \
+	apt-get clean autoclean && \
+    apt-get autoremove -y && \
+    rm -rf /var/lib/apt/lists/*
+
+
+################################
+#
+# Tika Service
+#
+FROM jre-11-base AS service-runner
+
+# setup env
+RUN mkdir -p /app/config
+WORKDIR /app
+
+# copy tika-server artifacts
+COPY --from=service-builder /devel/build/libs/service-*.jar ./
+COPY --from=service-builder /devel/src/main/resources/application.yaml ./config/
+
+COPY --from=service-builder /devel/scripts/run.sh ./
+
+# copy external tools configuration files
+COPY ./extras/ImageMagick/policy.xml /etc/ImageMagick-6/policy.xml
+
+# entry point
+CMD ["/bin/bash", "/app/run.sh"]
diff --git a/README.md b/README.md
@@ -1,5 +1,162 @@
 # Introduction
-Apache Tika running as a web service
+This project implements Apache Tika running as a web service using Spring Boot. It exposes a REST API so that a client can send a document in binary format and receive back the extracted text. The supported document formats are the ones as in Tika.
 
-# Status
-Work-in-progress ...
+Some of the key motivation behind developing own wrapper over Tika instead of using the already availabke [Tika server](https://cwiki.apache.org/confluence/display/tika/TikaJAXRS) is a better control over used document parsers (such as PDFParser, Tesseract OCR and the legacy one taken from [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline)) and control over returned results with HTTP return codes.
+
+
+# Building
+To build the application, run in the main directory:
+
+`./gradlew build`
+
+The build artifacts will be placed in `./build` directory.
+
+
+During the build, the tests will be run, where the failed tests can also signify missing third-party dependencies (see below). However, to skip running the tests and just build the application, one can run:
+
+`./gradlew bootJar`.
+
+
+## Tests
+To run the available tests, run:
+
+`./gradlew test`
+
+Please note that failed tests may signify missing third-party dependencies.
+
+
+## Third-party dependencies
+In the minimal setup, for proper text extraction Apache Tika requires the following applications to be present on the system:
+- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract),
+- [ImageMagick](https://imagemagick.org),
+- [Ghostscript](https://www.ghostscript.com/) (required by ImageMagick for documents conversion).
+
+ImageMagick also requires its configuration file `policy.xml` to be overriden by the provided `extras/ImageMagick/policy.xml` (in order to increase the the available resources for file processing and to override [security policy](https://stackoverflow.com/questions/52703123/override-default-imagemagick-policy-xml) related with Ghostscript).
+
+Moreover, in order to enable additional image processing capabilities of Tesseract OCR, few other dependencies need to be present in the system, such as Python environment. Please see the provided `Dockerfile` for the full list.
+
+
+# Running the application
+The application can be either run as a standalone Java application or inside a Docker container. The application configuration can be changed in the `application.yaml` file. The default version of configuration file is embeded in the jar file, but can be specified manually (see below).
+
+Please note that the recommended way is to use the provided Docker image since a number of dependencies need to be satisfied on a local machine.
+
+
+## Running as a standalone Java application
+Assuming that the build went correctly, to run the Tika service on a local machine:
+
+`java -jar build/jar/service-*.jar`
+
+The running service will be listening on port `8090` (by default) on the host machine. 
+
+
+## Using the Docker image
+The latest stable Docker image is available in the Docker Hub under `cogstacksystems/tika-service:latest` tag. Alternatively, the latest development version is available under `cogstacksystems/tika-service:dev-latest` tag. The image can be also build locally using the provided `Dockerfile`.
+
+
+To run Tika service container:
+
+`docker run -p 8090:8090 cogstacksystems/tika-service:latest`
+
+The service will be listening on port `8090` on the host machine.
+
+
+# API
+
+## API specification
+Tika Service, by default, will be listening on port `8090` and the returned content extraction result will be represented in JSON format. 
+
+The service exposes such endpoints:
+- *GET* `/api/info` - returns information about the service with its configuration,
+- *POST* `/api/process` - processes a binary data stream with the binary document content,
+- *POST* `/api/process_file` - processes a document file (multi-part request).
+
+## Document extraction result
+The extraction results are represented in JSON format where the available main fields are:
+- `result` - the content extraction result with metadata,
+- `timestamp` - the content processing timestamp,
+- `success` - specifies whether the extraction accomplished successfully,
+- `error` - the message in case of processing error (assumes `success : false`).
+
+The content extraction result can contain such fields:
+- `text` - the extracted text,
+- `metadata` - the metadata associated with the document and the used parsers.
+
+The provided metadata associated with the document and the used parsers can include such fields:
+- `X-Parsed-By` - an array of names of the parsers used during the content extraction,
+- `X-OCR-Applied` - a flag specifying whether OCR was applied,
+- `Content-Type` - the content type of the document, as identified by Tika,
+- `Page-Count` - the document page count (extracted from the document metadata by Tika),
+- `Creation-Date` - the document creation date (extracted from the document metadata by Tika).
+
+
+# Example use
+Using `curl` to send the document to Tika server instance running on localhost on `8090` port:
+
+`curl -F [email protected] http://localhost:8090/api/process_file | jq`
+
+Returned result:
+```
+{
+  "result": {
+    "text": "Sample Type / Medical Specialty: Lab Medicine - Pathology",
+    "metadata": {
+      "X-Parsed-By": [
+        "org.apache.tika.parser.CompositeParser",
+        "org.apache.tika.parser.DefaultParser",
+        "org.apache.tika.parser.microsoft.ooxml.OOXMLParser"
+      ],
+      "X-OCR-Applied": "false",
+      "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+    },
+    "success": true,
+    "timestamp": "2019-08-13T15:14:58.022+01:00"
+  }
+}
+```
+
+# Configuration
+
+## Configuration file
+All the available service and document processors parameteres are stored in a single `src/main/resources/application.yaml` file. 
+
+Although the initial configuration file is bundled with the application jar file, a modified one can be provided as a parameter when running the Java application. For example, when running the Tika service in the Docker container, the script `scripts/run.sh` runs the Tika service with custom configuration file `application.yaml` located in `/app/config/` directory: 
+`java -Dspring.config.location=/app/config/ -jar /app/service-*.jar`
+
+
+## Available properties
+The configuration file is stored in yaml format with the following available properties.
+
+### General application properties
+- `application.version` - specifies the application version,
+- `server.port` - the port number on which the service will be run (default: `8090`),
+- `spring.servlet.multipart.max-file-size` and `spring.servlet.multipart.max-request-size` - specifies the max file size when processing file requests (default: `100MB`).
+
+
+### Tika service configuration
+The following keys reside under `tika.processing` node:
+- `use-legacy-tika-processor-as-default` - whether to use the legacy Tika PDF parser (as used in CogStack Pipeline) for backward compatibility (default: `true`),
+- `fail-on-empty-files` - whether to fail the request and report an error when client provided an empty document (default: `false`),
+- `fail-on-non-document-types` - whether to fail the request and report an erorr when client provided a not supported and/or non-document content (default: `true`).
+
+
+### Tika parsers configuration
+The following keys reside under `tika.parsers` node.
+
+The keys under `tesseract-ocr` define the default behavior of the Tika Tesseract OCR parser:
+- `language` - the language dictionary used by Tesseract (default: `eng`),
+- `timeout` - the max time (ms) to process documents before reporting error (default: `300`),
+- `enable-image-processing` - whether to use additional pre-processing of the images using ImageMagick (default: `false`),
+- `apply-rotation` - whether to apply de-rotating of the images (default: `false`),
+Please note that enabling `enable-image-processing` and/or `apply-rotation` although might improve the quality of the extracted text can significantly slower the extraction process.
+
+The keys under `pdf-ocr-parser` define the default behavior of the PDF parser that uses Tesseract OCR to extract the text:
+- `ocr-only-strategy` - whether to use only OCR or to apply additional text extraction from the content (default: `true`),
+- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`),
+- `min-doc-byte-size` - the minimum size of the image data (in bytes) that should have the content to be extracted, otherwise is skipped (default: `10000`),
+- `use-legacy-ocr-parser-for-single-page-doc` - in case of single-page PDF documents, whether to use the legacy parser (default: `false`).
+
+The keys under `legacy-pdf-parser` define the behavior of the Tika PDF parser used in CogStack Pipeline (the 'legacy' parser), that is used for backward compatibility:
+- `image-magick.timeout` - the max timeout value (in ms) when performing document conversion using ImageMagick (default: `300`),
+- `tesseract-ocr.timeout` - the max timeout value (in ms) when performing text extraction using Tesseract OCR (default: `300`),
+- `min-doc-text-length` - if the available text in the document (before applying OCR) is higher than this value then skip OCR (default: `100`).