diff --git a/tika/README.md b/tika/README.md index 01d218c..ae1bb56 100644 --- a/tika/README.md +++ b/tika/README.md @@ -11,7 +11,7 @@ The Docker Image 'imixs/tika' provides a Tika Server. This server can be used fo ## The Rest API -The Rest API is provided by the [Apache Tika Project](https://tika.apache.org/). You will find details about the API [here](https://wiki.apache.org/tika/TikaJAXRS). +The Rest API is provided by the [Apache Tika Project](https://tika.apache.org/). You will find details about the API [here](https://cwiki.apache.org/confluence/display/TIKA/TikaServer). ### Get the Text of a Document @@ -97,10 +97,30 @@ This is an example for a tika configuration with higher OCR resolution: + + +## Using Header Parameters + +During a HTTP request it is also possible to pass through header parameters to the Tika Server. These header parameters are prafixed with *X-Tika-OCR* and *X-Tika-PDF*. + + $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only" + +The code that handles the X-Tika-OCR and X-Tika-PDF headers is the class [TikaResource.processHeaderConfig](https://github.com/apache/tika/blob/0bf11aec86079b8f1ae2f1ea680910ba79665c4f/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L190). + +Those header suffixes and values are mapped by the [TesseractOCRConfig](https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html) and [PDFParserConfig](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html) configuration objects via reflection. In this way you can set any config option with a corresponding header parameter. + +To see what X-Tika headers you can set, look up the options on the config class you want to tweak things on ([Tesseract](https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html) or [PDF](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html)), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called. + +For example the config method *[setExtractInlineImages](https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages-boolean-)* on PDF, maps to the header parameter + + X-Tika-PDFextractInlineImages + +**Note:** Header parameters are case sensitive! + ## OCR Tesseract -You can also confgure the OCR feature based on Tesseract. Find details [here](https://cwiki.apache.org/confluence/display/TIKA/TikaOCR). +You can also configure the OCR feature based on Tesseract. Find details [here](https://cwiki.apache.org/confluence/display/TIKA/TikaOCR). # Contribute