Skip to content

Commit

Permalink
docu
Browse files Browse the repository at this point in the history
  • Loading branch information
rsoika committed May 26, 2020
1 parent 4963de8 commit 7720f05
Showing 1 changed file with 22 additions and 2 deletions.
24 changes: 22 additions & 2 deletions tika/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The Docker Image 'imixs/tika' provides a Tika Server. This server can be used fo

## The Rest API

The Rest API is provided by the [Apache Tika Project](https://tika.apache.org/). You will find details about the API [here](https://wiki.apache.org/tika/TikaJAXRS).
The Rest API is provided by the [Apache Tika Project](https://tika.apache.org/). You will find details about the API [here](https://cwiki.apache.org/confluence/display/TIKA/TikaServer).

### Get the Text of a Document

Expand Down Expand Up @@ -97,10 +97,30 @@ This is an example for a tika configuration with higher OCR resolution:
</parser>
</parsers>
</properties>


## Using Header Parameters

During a HTTP request it is also possible to pass through header parameters to the Tika Server. These header parameters are prafixed with *X-Tika-OCR* and *X-Tika-PDF*.

$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"
The code that handles the X-Tika-OCR and X-Tika-PDF headers is the class [TikaResource.processHeaderConfig](https://github.com/apache/tika/blob/0bf11aec86079b8f1ae2f1ea680910ba79665c4f/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L190).

Those header suffixes and values are mapped by the [TesseractOCRConfig](https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html) and [PDFParserConfig](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html) configuration objects via reflection. In this way you can set any config option with a corresponding header parameter.

To see what X-Tika headers you can set, look up the options on the config class you want to tweak things on ([Tesseract](https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html) or [PDF](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html)), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

For example the config method *[setExtractInlineImages](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages-boolean-)* on PDF, maps to the header parameter

X-Tika-PDFextractInlineImages

**Note:** Header parameters are case sensitive!


## OCR Tesseract

You can also confgure the OCR feature based on Tesseract. Find details [here](https://cwiki.apache.org/confluence/display/TIKA/TikaOCR).
You can also configure the OCR feature based on Tesseract. Find details [here](https://cwiki.apache.org/confluence/display/TIKA/TikaOCR).


# Contribute
Expand Down

0 comments on commit 7720f05

Please sign in to comment.