Skip to content

Text Reading: Usage

Maria (Masha) Alexeeva edited this page Dec 23, 2021 · 30 revisions

Shortcuts:

To make sure everything runs properly and you know what is available, you may want to read the full page, but here are some key parts:

  1. Preparing input
    1. Cosmos
    2. Science Parse
    3. Markdown/Plain text
  2. Mention Extraction
    1. Scala App
    2. Webservice endpoint
    3. PDF -> mentions using Docker (not recently tested)
  3. Alignment of papers and source code
  4. Aligning model papers and source code documentation (markdowns)
  5. Visualizer webapp
  6. Grounding descriptions to wikidata

Input Types

Currently, we support reading from the following input types: json (Science Parse and COSMOS), markdown, and plain text. There is a file loader for each type as the input files are read slightly differently depending on input type. Loaders are located here. For some of the applications and scripts to run properly (e.g., ExtractAndExport, the input type (json, md, or txt) will need to be defined in application.conf here.

The type of reader may also need to be adjusted depending on the type of input (e.g., for markdown files, switch the ReaderType here to MarkdownEngine.

Loaders

Loaders help load and preprocess input files depending on their format. The loaders also take care of some preprocessing, e.g., for Cosmos input files, we exclude the sections that we expect to not contain prose from which we are able to extract mentions (see here) and combine blocks that look like they could belong to the same paragraph (see here), and for markdown files, we split the file on new lines to produce chunks that will be more easily parsed and remove backticks, as they interfere with the syntactic dependency parser (see here).

Cosmos

PDF files can be converted to text using the Cosmos Engine. The lab has an instance of the Cosmos pipeline setup, which is described here.

The pipeline produces parquet files, which need to be further processed before they can be used as input for the text reading pipeline.

We first convert .parqet files to a .json file by running this script with the directory containing the .parqet files as an argument:

python cosmos_integration.py <input_dir>

The output file will end in --COSMOS-data.json. It is important to keep this part of the file name because some of the components of the text reading pipeline will process files differently based on whether or not this part of the file name is present.

We then address some of the encoding issues using this script based on the ftfy package. This does not fix all the known encoding issues---some more will be fixed while the file is loaded during running of the text reading pipeline. This is not a required step, but it can improve the quality of extractions.

Science Parse

Science Parse produces a document with the full text of the publication along with some metadata, e.g., the year of publication, the list of authors, the document ID, etc. Currently, we extract events from three of these: the abstract, section headings, and section bodies. In the future, we hope to find a table reading tool---tables contain a lot of information presented in a concise and structured way, as compared to free text, which would make them useful input to the reading system we use. In addition, we hope to make use of the references section produced by Science Parse to connect in relevant information that isn't included in the current paper, and to develop a more sophisticated system of using section headings (e.g., to focus on certain, more informative sections).

As the PDF-to-text conversion process is always noisy, before extracting events, excessively noisy text (e.g., poorly converted tables) is filtered out using common sense heuristics (e.g., sentence length).

To perform the step, you need to have Science Parse running. Below are the two options of how to get Science Parse running.

Docker-compose

  • cd to the directory <automates_root>/text_reading/docker
  • Run the command docker-compose up scienceparse

Using Science Parse git repo

One option to do that is to clone the Science Parse git repo and run the science-parse-server-assembly jar with the following command:

java -Xmx8g -jar science-parse-server-assembly-2.0.3.jar

With the Science Parse running, the conversion can then be done in bulk by running:

sbt 'runMain org.clulab.aske.automates.apps.ParsePdfsToJson <directory_containing_pdfs> <output_directory>'

Markdown

Markdown do not require any additional preprocessing and can be run as is.

Plain text

Plain text does not require any additional preprocessing.

Scala Apps

To run the apps, run sbt run from the automates/automates/text_reading directory and, when prompted, choose the number associated with the app you want to run. For instance, to run the ExtractAndExport app/script when presented with this prompt, enter 6:

Multiple main classes detected, select one to run:

 [1] org.clulab.aske.automates.apps.AlignmentBaseline
 [2] org.clulab.aske.automates.apps.AutomatesShell
 [3] org.clulab.aske.automates.apps.ExportDoc
 [4] org.clulab.aske.automates.apps.ExtractAndAlign
 [5] org.clulab.aske.automates.apps.ExtractAndAssembleMentionEvents
 [6] org.clulab.aske.automates.apps.ExtractAndExport
 [7] org.clulab.aske.automates.apps.ParsePdfsToJson
 [8] org.clulab.aske.automates.entities.TestStuff

With most input files, the apps will be able to choose the correct loader depending on the file extension; however, since Science Parse jsons are not the default json type that we accept, to process Science parse jsons, set the dataLoader within the script that you want to run to ScienceParsedDataLoader:

val dataLoader = new ScienceParsedDataLoader

and remove the existing dataLoader assignment (this is true for both main mention extraction apps described further; in ExtractAndExport, this is done here.

There are two maintained scala scripts that you can use for mention extraction: The ExtractAndExport and ExtractAndAssembleMentionEvents. Both scripts take as input the directory with the files to process (to be defined in the application.conf) and store outputs in the output directory (to be defined in the application.conf). The ExtractAndExport script will, given a directory of input files, extract mentions and output them in several formats to the output directory. The output formats (exportAs here are tsv (most human-readable), json (with mentions exported as json objects that can be loaded back using the AutomatesJSONSerializer), and serialized (we do not do anything with these, but that is the functionality provided by processors, which the pipeline is using.

The second maintained script---ExtractAndAssembleMentionEvents---is used to extract mentions from multiple files associated with one model. The files can be scientific publications (jsons) and markdown files that document model code (mds). The script outputs a json file with mention extractions grouped by mention type. To produce a json file with fields suitable for annotation, set includeAnnotationField to true.

There is also a script for alignment of source code and scientific publications, but it is not currently maintained and a better way to achieve the same result is to run the align_experiment.py script described below. The other apps in the same directory are either legacy code or are described in other sections of the wiki.

Webapp

The webapp provides the most colorful, graphical, and perhaps easiest to understand output. It can be started up directly from the command line in one fell swoop

> sbt webapp/run

To override the default project directory, use the -Dapps.projectDir flag:

> sbt -Dapps.projectDir=/local/path/to/automates webapp/run

sbt may take several minutes to bring up the application, especially the first time, as various dependencies are downloaded and the reader itself is compiled. Numerous logging messages should keep you posted on the progress.

After starting the webapp, use a web browser to navigate to localhost:9000. There you should see something like this:

webapp image

You can now submit texts for the AutoMATES reader to process. Make sure to check the Show everything box for proper display of extractions. Note: the webapp is not configured to display relation mentions with more than two arguments.

Please note that the very first submission will require extra time as lazily loaded parts of the system are initialized, but subsequent texts will be processed much more quickly.

To eventually escape from sbt, you can stop the web server with Control-D and then quit the program with exit.

Web Service

We have a number of endpoints defined in HomeController. Those can be used to convert PDFs to jsons, extract mentions given pdfs and different json formats, ground mentions to WikiData, and more. When the webapp is run, it exposes a web service at port 9000 via these endpoints. Every endpoint accepts a POST request and takes a JSON.

For example, the /process_text endpoint, which extracts mentions from a given text string, takes a JSON with the following parameter:

  • text: the text you wish to submit for parsing by Eidos.

Querying endpoints

We can use the Python requests library to interact with the web service with the following (using the \process_text as an example):

import requests

text = """where LAI is the simulated leaf area index"""

webservice = 'http://localhost:9000'
res = requests.post('%s/process_text' %webservice, headers={'Content-type': 'application/json'}, json={'text': text})

json_dict = res.json()

We have a python script for querying endpoints. Unlike scala apps, which take directories with files as input, this script takes a file and outputs a file. Before running the script, modify this line to contain the method that you want to run out of those defined within the script. You can run the script this way:

python extract_align_ground.py <input_file_path> <output_file_path>

We can do the same with CURL (using \process_text as an example):

> curl \
  --header "Content-type: application/json" \
  --request POST \
  --data '{"text": "where LAI is the simulated leaf area index"}' \
  http://localhost:9000/process_text

Reading with predefined entities

Further, if you have a set of predefined entities (e.g., a gazetteer), you can pass them to the webservice as a list of strings and the reader will incorporate them.

> curl \
  --header "Content-Type: application/json"  \
  --request POST \
  --data '{"text":"where rain is the average rainfall in the valley", "entities":["rain"]}' \
  http://localhost:9000/process_text

Variables found in this way will have the label VariableGazetteer.

Grounding to wikidata

We can ground descriptions extracted from text to Wikidata by querying their SPARQL Query Service. To ground previously extracted mentions (extracted either through ExtractAndExport app or one of the web service endpoints---/json_doc_to_mentions or /cosmos_json_to_mentions, both runnable with the extract_align_ground.py script), run the call_groundMentionsToWikidata method using the the extract_align_ground.py script.

If there are too many mentions to ground, the query service might not be able to handle the amount. In that case, run the endpoint several times to make sure everything is grounded. There is a rudimentary cache that will save intermediate results to a file and will allow all the queries to be eventually sent to wikidata.

Alignment Pipeline

A major component that can be run using the web service is the aligner of various files associated with a model.

Figure 1: Linked elements overview
Figure 1: Linked elements overview

To contextualize the models lifted from source code, we have implemented a machine reading and alignment system that extracts model information from two sources: (a) the scientific papers and technical documents (further document PDFs) that describe a model of interest, from which we extract the variables, the definitions for those variables, and the parameter settings for the variables; and (b) the comments from the Fortran source code, from which we can extract the variables, variable definitions, and potentially the units. The system then aligns the information obtained from these two sources with the source code variables as well as equation variables extracted from the scientific papers with the use of our equation reading system.

With the alignment pipeline, we align the following elements:

Link Element Source Description Processing
text_var document PDF a variable extracted from free text text reading, rule-based extraction
text_span document PDF the description associated with the text variable text reading, rule-based extraction
equation_span document PDF the LaTeX representation of the variable extracted from a document PDF formala equation reading
comment_span source code the description associated with the source comment variable text reading, rule-based extraction
identifier source code source code variable program analysis
svo_grounding* Scientific Variable Ontology the Scientific Variable Ontology concept associated with the variable SPARQL endpoint
  • Note: SVO grounder is not currently used as the maintaining organization has disabled the API endpoint that we were querying

Input files :

  1. a grfn file (@Clayton todo: how is this produced)
  2. comments file: Source comments come as plain text, so our text reading system can be applied to them directly after minor pre-processing (e.g., trimming each comment line). The system is readily available for the DSSAT-style comment blocks, but will need to be modified to read other types of comments. The comments need to be passed as a .json file, e.g.,:
  "sir-simple$file_head": [
    "********************************************************************************\n",
    "!     Input Variables:\n",
    "!     S        Amount of susceptible members at the current timestep\n",
    "!     I        Amount of infected members at the current timestep\n",
    "!     R        Amount of recovered members at the current timestep\n",
    "!     beta     Rate of transmission via contact\n",
    "!     gamma    Rate of recovery from infection\n",
    "!     dt       Next inter-event time\n",
    "!\n",
    "!     State Variables:\n",
    "!     infected    Increase in infected at the current timestep\n",
    "!     recovered   Increase in recovered at the current timestep\n",
    "********************************************************************************\n"
  ],
  "sir-simple$file_foot": [],
  "@container::SIR-simple::@global::sir": {
    "head": [
      "********************************************************************************\n",
      "!     Input Variables:\n",
      "!     S        Amount of susceptible members at the current timestep\n",
      "!     I        Amount of infected members at the current timestep\n",
      "!     R        Amount of recovered members at the current timestep\n",
      "!     beta     Rate of transmission via contact\n",
      "!     gamma    Rate of recovery from infection\n",
      "!     dt       Next inter-event time\n",
      "!\n",
      "!     State Variables:\n",
      "!     infected    Increase in infected at the current timestep\n",
      "!     recovered   Increase in recovered at the current timestep\n",
      "********************************************************************************\n"
    ],
    "neck": [],
    "foot": [],
    "internal": {}
  }
}
  1. scientific publications in the json format (either Cosmos or Science Parse jsons; see the section on PDF to text conversion above)
  • latex equations .txt file, e.g.,:
\frac { d S } { d t } = - \frac { \beta I S } { N }
\frac { d I } { d t } = \frac { \beta I S } { N } - \gamma I
\frac { d R } { d t } = \gamma I
"wikiGroundings": [{
    "variable": "R0",
    "groundings": [{
      "searchTerm": "number",
      "conceptID": "http://www.wikidata.org/entity/Q11563",
      "conceptLabel": "number",
      "conceptDescription": ["mathematical object used to count, label, and measure"],
      "alternativeLabel": ["number concept"],
      "subClassOf": ["http://www.wikidata.org/entity/Q246672,http://www.wikidata.org/entity/Q309314"],
      "score": [3.266666666666667],
      "source": "wikidata"
    }, {
      "searchTerm": "basic reproduction number",
      "conceptID": "http://www.wikidata.org/entity/Q901464",
      "conceptLabel": "basic reproduction number",
      "conceptDescription": ["metric in epidemiology showing average measure of a pathogen\u2019s infectiousness"],
      "alternativeLabel": ["R0, basic reproductive number, basic reproductive rate, basic reproductive ratio, R nought, R zero, R\u2080, Rnaught"],
      "subClassOf": ["http://www.wikidata.org/entity/Q97204337"],
      "score": [3.2596153846153846],
      "source": "wikidata"
    }, {
      "searchTerm": "reproduction number",
      "conceptID": "http://www.wikidata.org/entity/Q901464",
      "conceptLabel": "basic reproduction number",
      "conceptDescription": ["metric in epidemiology showing average measure of a pathogen\u2019s infectiousness"],
      "alternativeLabel": ["R0, basic reproductive number, basic reproductive rate, basic reproductive ratio, R nought, R zero, R\u2080, Rnaught"],
      "subClassOf": ["http://www.wikidata.org/entity/Q97204337"],
      "score": [2.9907692307692306],
      "source": "wikidata"
    }]
  },
  {...}
]

Align information from all the sources

Overview

Currently, we produce the following alignments (see Figure 1):

  • Document equation (equation_span) -> document text span (text_span) (Link type 1):

    We align document text spans to the variables in the latex representation of the document equation, produced using computer vision-based equation reading system. Once we have the predicted latex equation, we chunk it using heuristics to group tokens which are part of the same variable. Every chunk, which is a potential variable, is normalized by removing LaTeX control sequences (e.g., E_\text{PEN} -> EPEN) and replacing spelled-out Greek letters with unicode ('alpha' -> α). We then align the normalized chunk to extracted text variables through string edit distance. Since we extracted the text variables in the context of their definitions, this essentially produces an alignment between the equation chunk and a text definition (i.e., text_span), e.g.:

    E_\text{PEN} (latex chunk) -> EPEN (normalized chunk) -> EPEN (text_var) -> results of the Penman model (text_span)

  • Source code identifier (identifier) -> source code comment text span (comment_span) (Link type 2):

    Since source code identifiers and source code comments come from the same code base, they match exactly (case-insensitive). This mean we can align source code identifiers to source comment text spans by finding a comment definition mention whose variable argument matches the identifier exactly.

  • Source code comment text span (comment_span) -> document text span (text_span) (Link type 3):

    The alignment is based on comparing word embeddings from the words in the extracted definitions from free text and source code comments. Formally, each of the embeddings for the words in the description of a source code variable (identifier) are compared with each of the embeddings for the words in the definition of a text variable (text_var) using cosine similarity. The alignment score between the variables is the sum of (a) the average and (b) the maximum vector similarities, for a score which ranges between [-2.0, 2.0] (from least similar to most similar). The intuition is that the alignment is based on the overall similarity as well as the single best semantic match.

  • Document in-text variable (text_var) -> document text span (text_span) (Link type 4)

    We produce these two elements jointly as arguments from a definition mention, which means no additional alignment step is required.

- Document in-text variable (text_var) -> SVO concept (svo_grounding) (Link type 5)

We query the SVO ontology for each text mention, so no additional alignment step is required.

With the exception of comment_span/text_span alignment, which includes both string edit distance and embedding-based similarity and is not currently normalized, elements are aligned using Levenshtein (string edit) distance and are normalized to range between 0 (no match) and 1 (perfect match). Either all or some of the elements can be aligned.

Usage

The alignment pipeline can be run via the /align endpoint. There is a script that will create a properly-formatted post request files for the enpoint, run it, and will produce an output file. Run the script as a regular python file with the inputs listed above as arguments (in the order they are listed in the input files above. The wikidata file needs to be passed preceded with --wikidata_file flag. The defaults for the toggles are set and can be modified here.