Text Reading: Setup

Text Reading

NOTE: running with Java 11 may cause errors related to Stanford NLP utils. In that case, downgrade to Java 8 (see here for discussion).

To install java on a mac:

Download: https://adoptopenjdk.net/?variant=openjdk8&jvmVariant=hotspot

> brew tap AdoptOpenJDK/openjdk
> brew install --cask adoptopenjdk8

Dependencies:

Regextools:

Clone the repo: https://github.com/lum-ai/regextools. To do this using HTTPS, run the following command from the terminal:

git clone https://github.com/lum-ai/regextools.git
Switch to the repo you just cloned:

cd regextools
Run the following command to import the tool:

sbt publishLocal

Python dependencies

Python 3.x.x
SPARQLWrapper:

pip install SPARQLWrapper
numpy
tqdm
lxml
webcolors
pdf2image
pillow
pdfminer.six

NOTE: dependencies 4-9 are required for proper functioning of pdfalign methods and are described in the pdfalign setup page or the pdfalign readme page.

#TODO: there may be more python dependencies that have to be installed

Editing the config file:

For text reading and alignment pipeline to successfully run, you need to specify several paths in the configuration file (application.conf). Update these fields with the paths on your machine:

ReaderType - choose the type of engine to use. See available Engines

<ReaderType>.preprocessorType - choose how much cleaning to do on the input text (the argument is present in every reader available, e.g., TextEngine, so make sure to change it for the reader being used)

grounding.sparqlDir - path to the directory containing the code required to ground Mentions to the Scientific Variable Ontology using SPARQLWrapper.

alignment.w2vPath - path to the word embedding file (see details in #w2vec)).

apps.inputType - input file extension, e.g., json.

You can update the other paths in the config file as needed, e.g., update the apps.inputDirectory and apps.outputDirectory if you wish to use any of the apps here.

Word vectors

For some of the functionality of the text reading and alignment pipeline, you will need to use pretrained word embeddings. You can use your own or use pretrained embeddings available for download, e.g., GloVe vectors.

To prepare the vectors for the use with our pipeline, please append a header with this format <num_of_tokens><\s><size_of_embedding>, e.g., 6B 50. Embedding files can be hard to open with many editors due to their size, so consider using Vim to add the heading to your embedding file.

Note: using different word embeddings may slightly change the outputs of the alignment pipeline.

Setting python path

For some apps and scripts to run properly, you will need to set PYTHONPATH as a shell environment variable, using the following command:

export PYTHONPATH="<path_to_automates_root_directory>:${PYTHONPATH}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly