-
Notifications
You must be signed in to change notification settings - Fork 9
Text Reading: Setup
NOTE: running with Java 11 may cause errors related to Stanford NLP utils. In that case, downgrade to Java 8 (see here for discussion).
Download: https://adoptopenjdk.net/?variant=openjdk8&jvmVariant=hotspot
> brew tap AdoptOpenJDK/openjdk
> brew install --cask adoptopenjdk8
-
Clone the repo:
https://github.com/lum-ai/regextools
. To do this using HTTPS, run the following command from the terminal:git clone https://github.com/lum-ai/regextools.git
-
Switch to the repo you just cloned:
cd regextools
-
Run the following command to import the tool:
sbt publishLocal
-
Python 3.x.x
-
pip install SPARQLWrapper
-
numpy
-
tqdm
-
lxml
-
webcolors
-
pdf2image
-
pillow
-
pdfminer.six
NOTE: dependencies 4-9 are required for proper functioning of pdfalign methods and are described in the pdfalign setup page or the pdfalign readme page.
#TODO: there may be more python dependencies that have to be installed
For text reading and alignment pipeline to successfully run, you need to specify several paths in the configuration file (application.conf). Update these fields with the paths on your machine:
ReaderType
- choose the type of engine to use. See available Engines
<ReaderType>.preprocessorType
- choose how much cleaning to do on the input text (the argument is present in every reader available, e.g., TextEngine, so make sure to change it for the reader being used)
grounding.sparqlDir
- path to the directory containing the code required to ground Mentions to the Scientific Variable Ontology using SPARQLWrapper.
alignment.w2vPath
- path to the word embedding file (see details in #w2vec)).
apps.inputType
- input file extension, e.g., json.
You can update the other paths in the config file as needed, e.g., update the apps.inputDirectory
and apps.outputDirectory
if you wish to use any of the apps here.
For some of the functionality of the text reading and alignment pipeline, you will need to use pretrained word embeddings. You can use your own or use pretrained embeddings available for download, e.g., GloVe vectors.
To prepare the vectors for the use with our pipeline, please append a header with this format <num_of_tokens><\s><size_of_embedding>
, e.g., 6B 50
. Embedding files can be hard to open with many editors due to their size, so consider using Vim to add the heading to your embedding file.
Note: using different word embeddings may slightly change the outputs of the alignment pipeline.
For some apps and scripts to run properly, you will need to set PYTHONPATH
as a shell environment variable, using the following command:
export PYTHONPATH="<path_to_automates_root_directory>:${PYTHONPATH}"