Skip to content

Commit

Permalink
Merge pull request #70 from jamesvillarrubia/update-to-nlm-2.9.2_v2
Browse files Browse the repository at this point in the history
Update to nlm 2.9.2 v2
  • Loading branch information
ansukla authored Jul 26, 2024
2 parents 465e6a1 + 39b1c1b commit b3608fe
Show file tree
Hide file tree
Showing 4 changed files with 5 additions and 3 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ There are two ways to process these types of documents
1. Install latest version of java from https://www.oracle.com/java/technologies/downloads/
2. Run the tika server:
```
java -jar <path_to_nlm_ingestor>/jars/tika-server-standard-nlm-modified-2.9.2_v1.jar
java -jar <path_to_nlm_ingestor>/jars/tika-server-standard-nlm-modified-2.9.2_v2.jar
```
3. Install the ingestor
```
Expand Down
Binary file not shown.
4 changes: 3 additions & 1 deletion nlm_ingestor/file_parser/tika_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ def parse_to_html(self, filepath, do_ocr=False):
# Turn off OCR by default
timeout = 3000
headers = {
"X-Tika-OCRskipOcr": "true"
"X-Tika-OCRskipOcr": "true",
"X-Tika-PDFOcrStrategy": "auto",
"X-Tika-PDFExtractFontNames": "true"
}
if do_ocr:
headers = {
Expand Down
2 changes: 1 addition & 1 deletion run.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
# latest version of java and a python environment where requirements are installed is required
nohup java -jar jars/tika-server-standard-nlm-modified-2.9.2_v1.jar > /dev/null 2>&1 &
nohup java -jar jars/tika-server-standard-nlm-modified-2.9.2_v2.jar > /dev/null 2>&1 &
python -m nlm_ingestor.ingestion_daemon

0 comments on commit b3608fe

Please sign in to comment.