Skip to content

Equation Detection and Decoding

Becky Sharp edited this page May 22, 2020 · 4 revisions

Models are often represented concisely as equations, at a level of abstraction that can supplement both the natural language description as well as the source code implementation. Accordingly, here we describe the AutoMATES module for automatically reading equations found in scientific papers. This section details the approaches for (a) data acquisition, (b) detecting the location of equations, (c) encoding the image of the equation and then decoding it into a formal representation, and (d) converting the formal representation into an executable form that can be used in Program Analysis. Here we discuss current progress as well as planned next steps. To make rapid progress, the team has extensively explored available state-of-the-art (SOA) open-source tools and resources, and so additionally we discuss the limitations of these tools as well as our plans for addressing these limitations.

All code for the Equation Reading pipeine is implemented within the AutoMATES equation_extraction repository directory. There is additionally a section below that has links to readmes.

Data collection

We constructed several datasets in order to train and evaluate the neural machine learning components used in the detection and decoding of equations found in text. For this work, the team is making use of papers written in LaTeX (TeX), downloaded in bulk from arXiv, an open-access preprint database for scientific publications. The team has downloaded the complete set of arXiv PDFs and their corresponding source files from Amazon S3 (as described here). Similar datasets have been constructed previously, but they are very limited in scope. For example, a sample of source files from the hep-th (theoretical high-energy physics) section of arXiv was collected in 2003 for the KDD cup competition (see equation decoding section for examples of the consequence of this limited scope). By downloading the full set of arXiv, the team has extended this dataset to both increase the number of training examples and to include a variety of AutoMATES-relevant domains (including agriculture, biology, and computer science).

Dataset preprocessing pipeline

The team has put together a preprocessing pipeline with the purpose of preparing the downloaded arXiv data for use in equation detection and decoding.

First, the paper source files are organized in a directory structure that can be processed efficiently. Then, for each paper, the TeX file that has the \documentclass directive is selected as the main TeX file (see here for more information). Once a main TeX file has been selected, the TeX source is tokenized using plasTeX and the content of certain environments are collected together with the environment itself (e.g., the tokens inside the \begin{equation} and \end{equation} directives, together with the label equation). User-defined macros are expanded using a recursively applied lookup table to normalize the input to the neural decoder.

Based on an analysis of 1600 arXiv papers, the most commonly used math environments (in order) are: equation, align, and \[ \]. While the Prototype currently only handles the equation environment (40% of the equations found), the pipeline will be extended to accomodate the other two types in the future.

The extracted code for each equation is rendered into a standalone equation image. The paired standalone image and source tokens form the training data for the equation decoder. Additionally, the PDF file for the entire paper is scanned for the standalone equation image using template matching. The resulting axis-aligned bounding box (AABB) is stored for the subsequent training of an equation detector. The team also incorporated template rescaling to better match the rendered equation against the original PDF image. This resulted in significantly more accurate axis-aligned bounding boxes.

Equation detection

Before equations can be decoded, they first need to be located within the scientific papers encoded as PDF files. For this, the team evaluated standard machine vision techniques. The SOA Mask-RCNN (He et al., 2017) was selected both for its robust performance across several detection tasks as well as its ease of use. Here, as the desired output of the model is the page and AABB of the detected equations, we ignore the mask (i.e., the precise set of pixels which compose the object), and as such the model is essentially an easy to use Faster R-CNN (Ren et al., 2015).

The Faster R-CNN model uses a base network consisting of a series of convolutional and pooling layers as feature extractors for subsequent steps. This network is typically a ResNet backbone trained over ImageNet or COCO.

Next, a region proposal network (RPN) uses the features found in the previous step to propose a predefined number of bounding boxes that may contain equations. For this purpose, fixed bounding boxes of different sizes are placed throughout the image. Then the RPN predicts two values: the probability that the bounding box contains an object of interest, and a correction to the bounding box to make it better fit the object.

At this point, the Faster R-CNN uses a second step to classify the type of object, using a traditional R-CNN. Since there is only one type of object of interest (equations), the output of the RPN can be used directly, simplifying training and speeding up inference. However, one potential disadvantage of only having a single label is that the model could be confused by similar page components (e.g., section titles and tables).

TODO update!

Equation decoding

Once detected, the rendered equations need to be automatically converted into LaTeX code. For this task we employ a variant of an encoder-decoder architecture that encodes the equation image into a dense embedding and then decodes it into LaTeX code capable of being compiled back into an image. LaTeX was selected as the intermediate representation between input image and the eventual target executable model of the equation because of the availability of a large amount of training data (arXiv) and because LaTeX preserves both typographic information about how equations are rendered (e.g., bolding, italics, subscript, etc.) while also preserving the components of the notation that will be used for the successful interpretation of the equation semantics.

Encoder-decoder architectures have been successfully applied to image caption generation (e.g., Vinyals et al., 2017), a task that is similar to our task of mapping equation images to equation code. In order to make rapid progress, we began with an existing SOA model previously trained for the purpose of converting images to markup (Deng et al., 2017) implemented in OpenNMT. The model was trained with the 2003 KDD cup competition, which itself consists of a subset of arXiv physics papers.

We found that the pre-trained Deng et al. model did not generalize well to our use case, and further that the primary source of the problem was in the image formatting, rather than the domain. To address this, we (a) retrained the model wit additional data (approximately 1M equations from the arXiv data described above) and (b) performed data augmentation on the rendered equation images to improve the robustness of the model. For the data-augmentation, we randomly augmented the arxiv images with different types and amounts of downsampling.

The model training and inference is done using the steps described here.

Conversion to executable representation

The final stage in the pipeline is the conversion of the equation to an executable representation. We chose to use SymPy for two reasons. First, and primarily, SymPy provides a symbolic representation of the equation so that while it is executable, variables can remain unassigned. Second, the Program and Model Analysis uses python as the intermediate language.

Instructions for running components

We have separate README files for the individual components of the equation reading pipeline: