diff --git a/CHANGELOG.md b/CHANGELOG.md index 45a0ce6b9a..fbc40a2013 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,32 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). ### Fixed +## [0.7.0] – 2020-07- + +### Added + ++ New YAML configuration: all the settings are in one single yaml file, each model can be fully configured independently ++ Improvement of the segmentation and header models (for header, +1 F1-score for PMC evaluation, +4 F1-score for bioRxiv), improvements for body and citations ++ Add figure and table pop-up visualization on PDF in the console demo ++ Add PDF MD5 digest in the TEI results (service only) ++ Language support packages and xpdfrc file for pdfalto (support of CJK and exotic fonts) ++ Prometheus metrics ++ BidLSTM-CRF-FEATURES implementation available for more models ++ Addition of a "How GROBID works" page in the documentation + +### Changed + ++ JitPack release (RIP jcenter) ++ Improved DOI cleaning ++ Speed improvement (around +10%), by factorizing some layout token manipulation ++ Update CrossRef requests implementation to align to the current usage of CrossRef's `X-Rate-Limit-Limit` response parameter + +### Fixed + ++ Fix base url in demo console ++ Add missing pdfalto Graphics information when `-noImage` is used, fix graphics data path in TEI ++ Fix the tendency to merge tables when they are in close proximity + ## [0.6.2] – 2020-03-20 ### Added diff --git a/Dockerfile.delft b/Dockerfile.delft index ce34c05bf7..f4ffc26b46 100644 --- a/Dockerfile.delft +++ b/Dockerfile.delft @@ -2,14 +2,14 @@ ## See https://grobid.readthedocs.io/en/latest/Grobid-docker/ -## usage example with version 0.6.2-SNAPSHOT: -## docker build -t grobid/grobid:0.6.2-SNAPSHOT --build-arg GROBID_VERSION=0.6.2-SNAPSHOT --file Dockerfile.delft . +## usage example with version 0.7.1-SNAPSHOT: +## docker build -t grobid/grobid:0.7.1-SNAPSHOT --build-arg GROBID_VERSION=0.7.1-SNAPSHOT --file Dockerfile.delft . ## no GPU: -## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.6.2-SNAPSHOT +## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.1-SNAPSHOT ## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine): -## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/obid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.6.2-SNAPSHOT +## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/obid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.1-SNAPSHOT # ------------------- # build builder image diff --git a/Readme.md b/Readme.md index 25c7220afe..a7a4ff19d1 100644 --- a/Readme.md +++ b/Readme.md @@ -24,28 +24,26 @@ GROBID is a machine learning library for extracting, parsing and re-structuring The following functionalities are available: - __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.). -- __References extraction and parsing__ from articles in PDF format, around .87 f-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references. All the usual publication metadata are covered (including DOI, PMID, etc.). -- __Citation contexts recognition and resolution__ to the full bibliographical references of the article. The accuracy of citation contexts resolution is above .76 f-score (which corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference). -- Parsing of __references in isolation__ (around .90 f-score at instance-level, .95 f-score at field level). +- __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .89 on a similar bioRxiv set. All the usual publication metadata are covered (including DOI, PMID, etc.). +- __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is above .78 f-score (which corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference). +- Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level). - __Parsing of names__ (e.g. person title, forenames, middlename, etc.), in particular author names in header, and author names in references (two distinct models). - __Parsing of affiliation and address__ blocks. - __Parsing of dates__, ISO normalized day, month, year. -- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference callout, figure, table, etc.). -- __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI resolution performance is higher than 0.95 f-score from PDF extraction. +- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference callout, figure, table, etc.). +- __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI resolution performance is higher than 0.95 F1-score from PDF extraction. - __Extraction and parsing of patent and non-patent references in patent__ publications. - __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF. -In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middlenames, affiliation types, detailed address, journal, volume, issue, pages, doi, pmid, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure headers, etc.). +In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middlenames, affiliation types, detailed address, journal, volume, issue, pages, doi, pmid, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.). GROBID includes a comprehensive web service API, batch processing, a JAVA API, a Docker image, a generic evaluation framework (precision, recall, etc., n-fold cross-evaluation) and the semi-automatic generation of training data. -GROBID can be considered as production ready. Deployments in production includes ResearchGate, HAL Research Archive, INIST-CNRS, CERN (Invenio), scite.ai, and many more. The tool is designed for high scalability in order to address the full scientific literature corpus. +GROBID can be considered as production ready. Deployments in production includes ResearchGate, Internet Archive Scholar, HAL Research Archive, INIST-CNRS, CERN (Invenio), scite.ai, Academia.edu and many more. The tool is designed for speed and high scalability in order to address the full scientific literature corpus. GROBID should run properly "out of the box" on Linux (64 bits) and macOS. We cannot ensure currently support for Windows as we did before (help welcome!). -GROBID uses optionnally Deep Learning models relying on the [DeLFT](https://github.com/kermitt2/delft) library, a task-agnostic Deep Learning framework for sequence labelling and text classification. The tool can run with feature engineered CRF (default), Deep Learning architectures (with or without layout feature channels) or any mixtures of CRF and DL to balance scalability and accuracy. - -For more information on how the tool works, on its key features and [benchmarking](https://grobid.readthedocs.io/en/latest/Benchmarking/), visit the [GROBID documentation](https://grobid.readthedocs.org). +GROBID uses optionnally Deep Learning models relying on the [DeLFT](https://github.com/kermitt2/delft) library, a task-agnostic Deep Learning framework for sequence labelling and text classification. The tool can run with feature engineered CRF (default), Deep Learning architectures (with or without layout feature channels) or any mixtures of CRF and DL to balance scalability and accuracy. These models use joint text and visual/layout information provided by [pdfalto](https://github.com/kermitt2/pdfalto). ## Demo @@ -57,7 +55,7 @@ _Warning_: Some quota and query limitation apply to the demo server! Please be c ## Clients -For helping to exploit GROBID service at scale, we provide clients written in Python, Java, node.js using the [web services](https://grobid.readthedocs.io/en/latest/Grobid-service/) for parallel batch processing: +For facilitating the usage GROBID service at scale, we provide clients written in Python, Java, node.js using the [web services](https://grobid.readthedocs.io/en/latest/Grobid-service/) for parallel batch processing: - Python GROBID client - Java GROBID client @@ -69,21 +67,35 @@ We have been able recently to run the complete fulltext processing at around 10. In addition, a Java example project is available to illustrate how to use GROBID as a Java library: [https://github.com/kermitt2/grobid-example](https://github.com/kermitt2/grobid-example). The example project is using GROBID Java API for extracting header metadata and citations from a PDF and output the results in BibTeX format. -Finally, the following python utilities can be used to create structured full text corpora of scientific articles simply by indicating a list of strong identifiers like DOI or PMID, performing the identification of online Open Access PDF, the harvesting, the metadata agreegation and the Grobid processing in one step at scale: [article-dataset-builder](https://github.com/kermitt2/article-dataset-builder) +Finally, the following python utilities can be used to create structured full text corpora of scientific articles. The tool simply takes a list of strong identifiers like DOI or PMID, performing the identification of online Open Access PDF, full text harvesting, metadata agreegation and Grobid processing in one workflow at scale: [article-dataset-builder](https://github.com/kermitt2/article-dataset-builder) + +## How GROBID works + +Visit the [documentation page describing the system](https://grobid.readthedocs.io/en/latest/Principles/). To summarize, the key design principles of GROBID are: + +- GROBID uses a [cascade of sequence labeling models](https://grobid.readthedocs.io/en/latest/Principles/#document-parsing-as-a-cascade-of-sequence-labeling-models) to parse a document. + +- The different models [do not work on text, but on **Layout Tokens**](https://grobid.readthedocs.io/en/latest/Principles/#layout-tokens-not-text) to exploit various visual/layout information avalable for every tokens. + +- GROBID does not use training data derived from existing publisher XML documents, but [small, high quality sets](https://grobid.readthedocs.io/en/latest/Principles/#training-data-qualitat-statt-quantitat) of manually labeled training data. + +- Technical choices and [default settings](https://grobid.readthedocs.io/en/latest/Principles/#balancing-accuracy-and-scalability) are driven by the ability to process PDF quickly, with commodity hardware and with good parallelization and scalabilty capacities. + +Detailed end-to-end [benchmarking](https://grobid.readthedocs.io/en/latest/Benchmarking/) are available [GROBID documentation](https://grobid.readthedocs.org) and continuously updated. ## GROBID Modules A series of additional modules have been developed for performing __structure aware__ text mining directly on scholar PDF, reusing GROBID's PDF processing and sequence labelling weaponery: -- [grobid-ner](https://github.com/kermitt2/grobid-ner): named entity recognition -- [grobid-quantities](https://github.com/kermitt2/grobid-quantities): recognition and normalization of physical quantities/measurements - [software-mention](https://github.com/Impactstory/software-mentions): recognition of software mentions and attributes in scientific literature -- [grobid-astro](https://github.com/kermitt2/grobid-astro): recognition of astronomical entities in scientific papers -- [grobid-bio](https://github.com/kermitt2/grobid-bio): a bio-entity tagger using BioNLP/NLPBA 2004 dataset -- [grobid-dictionaries](https://github.com/MedKhem/grobid-dictionaries): structuring dictionaries in raw PDF format +- [grobid-quantities](https://github.com/kermitt2/grobid-quantities): recognition and normalization of physical quantities/measurements - [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors): recognition of superconductor material and properties in scientific literature - [entity-fishing](https://github.com/kermitt2/entity-fishing), a tool for extracting Wikidata entities from text and document, can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interative layout. - [dataseer-ml](https://github.com/dataseer/dataseer-ml): identification of sections and sentences introducing a dataset in a scientific article, and classification of the type of this dataset. +- [grobid-ner](https://github.com/kermitt2/grobid-ner): named entity recognition +- [grobid-astro](https://github.com/kermitt2/grobid-astro): recognition of astronomical entities in scientific papers +- [grobid-bio](https://github.com/kermitt2/grobid-bio): a bio-entity tagger using BioNLP/NLPBA 2004 dataset +- [grobid-dictionaries](https://github.com/MedKhem/grobid-dictionaries): structuring dictionaries in raw PDF format ## Release and changes @@ -121,5 +133,3 @@ If you want to cite this work, please refer to the present GitHub project, toget ``` See the [GROBID documentation](https://grobid.readthedocs.org/en/latest/References) for more related resources. - - diff --git a/doc/Benchmarking-biorxiv.md b/doc/Benchmarking-biorxiv.md index e8a1488dd6..67bd18ee3c 100644 --- a/doc/Benchmarking-biorxiv.md +++ b/doc/Benchmarking-biorxiv.md @@ -2,15 +2,341 @@ ## General -This is the end-to-end benchmarking result for GROBID version **0.6.2** against the `bioRxiv` test set (`biorxiv-10k-test-2000`), see the [End-to-end evaluation](End-to-end-evaluation.md) page for explanations and for reproducing this evaluation. +This is the end-to-end benchmarking result for GROBID version **0.7.0** against the `bioRxiv` test set (`biorxiv-10k-test-2000`), see the [End-to-end evaluation](End-to-end-evaluation.md) page for explanations and for reproducing this evaluation. + +The following end-to-end results are using: +- **BidLSTM-CRF-FEATURES** as sequence labeling for the citation model +- **CRF Wapiti** as sequence labelling engine for all other models. + +Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service should be similar but much slower). + +Other versions of these benchmarks with variants and **Deep Learning models** (e.g. newer master snapshots) are available [here](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc). Note that Deep Learning models might provide higher accuracy, but at the cost of slower runtime and more expensive CPU/GPU resources. + +Evaluation on 1999 PDF preprints out of 2000 (1 PDF "too many blocks" interruption). + +Runtime for processing 2000 PDF: **1169s** (1,71 PDF per second) on Ubuntu 16.04, 4 CPU i7-4790K (8 threads), 16GB RAM (workstation bought in 2015 for 1600 euros) and with a GeForce GTX 1050 Ti GPU. + +## Header metadata + +Evaluation on 1999 random PDF files out of 2000 PDF (ratio 1.0). + +#### Strict Matching (exact matches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| abstract | 2.24 | 2.16 | 2.2 | 1989 | +| authors | 84.06 | 83.13 | 83.59 | 1998 | +| first_author | 94.69 | 93.74 | 94.21 | 1996 | +| keywords | 59.91 | 60.91 | 60.4 | 839 | +| title | 86.84 | 84.14 | 85.47 | 1999 | +| | | | | | +| **all fields (micro avg.)** | **66.58** | **65.39** | **65.98** | 8821 | +| all fields (macro avg.) | 65.54 | 64.82 | 65.17 | 8821 | + + + +#### Soft Matching (ignoring punctuation, case and space characters mismatches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| abstract | 57.42 | 55.46 | 56.42 | 1989 | +| authors | 84.62 | 83.68 | 84.15 | 1998 | +| first_author | 94.79 | 93.84 | 94.31 | 1996 | +| keywords | 65.77 | 66.87 | 66.31 | 839 | +| title | 92.36 | 89.49 | 90.9 | 1999 | +| | | | | | +| **all fields (micro avg.)** | **80.78** | **79.33** | **80.05** | 8821 | +| all fields (macro avg.) | 78.99 | 77.87 | 78.42 | 8821 | + + + +#### Levenshtein Matching (Minimum Levenshtein distance at 0.8) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| abstract | 76.83 | 74.21 | 75.5 | 1989 | +| authors | 92.46 | 91.44 | 91.95 | 1998 | +| first_author | 95.19 | 94.24 | 94.71 | 1996 | +| keywords | 78.31 | 79.62 | 78.96 | 839 | +| title | 95.25 | 92.3 | 93.75 | 1999 | +| | | | | | +| **all fields (micro avg.)** | **88.85** | **87.26** | **88.05** | 8821 | +| all fields (macro avg.) | 87.61 | 86.36 | 86.97 | 8821 | + + + +#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| abstract | 73.87 | 71.34 | 72.58 | 1989 | +| authors | 88.16 | 87.19 | 87.67 | 1998 | +| first_author | 94.69 | 93.74 | 94.21 | 1996 | +| keywords | 71.28 | 72.47 | 71.87 | 839 | +| title | 93.96 | 91.05 | 92.48 | 1999 | +| | | | | | +| **all fields (micro avg.)** | **86.11** | **84.57** | **85.34** | 8821 | +| all fields (macro avg.) | 84.39 | 83.16 | 83.76 | 8821 | + + +#### Instance-level results + +``` +Total expected instances: 1999 +Total correct instances: 34 (strict) +Total correct instances: 753 (soft) +Total correct instances: 1158 (Levenshtein) +Total correct instances: 1026 (ObservedRatcliffObershelp) + +Instance-level recall: 1.7 (strict) +Instance-level recall: 37.67 (soft) +Instance-level recall: 57.93 (Levenshtein) +Instance-level recall: 51.33 (RatcliffObershelp) +``` + + +## Citation metadata + +Evaluation on 1999 random PDF files out of 2000 PDF (ratio 1.0). + +#### Strict Matching (exact matches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| authors | 86.67 | 78.42 | 82.34 | 97138 | +| date | 91.41 | 83.27 | 87.15 | 97585 | +| doi | 72.61 | 80.38 | 76.3 | 16893 | +| first_author | 93.54 | 84.57 | 88.83 | 97138 | +| inTitle | 81.53 | 77.13 | 79.27 | 96384 | +| issue | 93.61 | 85.63 | 89.44 | 30282 | +| page | 96.52 | 78.67 | 86.69 | 88558 | +| pmcid | 63.26 | 63.57 | 63.41 | 807 | +| pmid | 66.96 | 75.63 | 71.03 | 2093 | +| title | 84.29 | 80.7 | 82.45 | 92423 | +| volume | 95.57 | 92.79 | 94.16 | 87671 | +| | | | | | +| **all fields (micro avg.)** | **89.25** | **82.2** | **85.58** | 706972 | +| all fields (macro avg.) | 84.18 | 80.07 | 81.92 | 706972 | + + + +#### Soft Matching (ignoring punctuation, case and space characters mismatches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| authors | 87.94 | 79.56 | 83.54 | 97138 | +| date | 91.41 | 83.27 | 87.15 | 97585 | +| doi | 77.11 | 85.35 | 81.02 | 16893 | +| first_author | 94 | 84.99 | 89.27 | 97138 | +| inTitle | 91.18 | 86.26 | 88.65 | 96384 | +| issue | 93.61 | 85.63 | 89.44 | 30282 | +| page | 96.52 | 78.67 | 86.69 | 88558 | +| pmcid | 73.74 | 74.1 | 73.92 | 807 | +| pmid | 71.36 | 80.6 | 75.7 | 2093 | +| title | 92.48 | 88.55 | 90.47 | 92423 | +| volume | 95.57 | 92.79 | 94.16 | 87671 | +| | | | | | +| **all fields (micro avg.)** | **92.1** | **84.83** | **88.32** | 706972 | +| all fields (macro avg.) | 87.72 | 83.62 | 85.46 | 706972 | + + + +#### Levenshtein Matching (Minimum Levenshtein distance at 0.8) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| authors | 92.69 | 83.86 | 88.06 | 97138 | +| date | 91.41 | 83.27 | 87.15 | 97585 | +| doi | 79.82 | 88.36 | 83.87 | 16893 | +| first_author | 94.15 | 85.12 | 89.41 | 97138 | +| inTitle | 92.13 | 87.16 | 89.58 | 96384 | +| issue | 93.61 | 85.63 | 89.44 | 30282 | +| page | 96.52 | 78.67 | 86.69 | 88558 | +| pmcid | 73.74 | 74.1 | 73.92 | 807 | +| pmid | 71.4 | 80.65 | 75.75 | 2093 | +| title | 95.33 | 91.27 | 93.26 | 92423 | +| volume | 95.57 | 92.79 | 94.16 | 87671 | +| | | | | | +| **all fields (micro avg.)** | **93.36** | **85.99** | **89.53** | 706972 | +| all fields (macro avg.) | 88.76 | 84.63 | 86.48 | 706972 | + + + +#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| authors | 89.79 | 81.23 | 85.3 | 97138 | +| date | 91.41 | 83.27 | 87.15 | 97585 | +| doi | 79.05 | 87.5 | 83.06 | 16893 | +| first_author | 93.59 | 84.61 | 88.87 | 97138 | +| inTitle | 89.87 | 85.03 | 87.38 | 96384 | +| issue | 93.61 | 85.63 | 89.44 | 30282 | +| page | 96.52 | 78.67 | 86.69 | 88558 | +| pmcid | 63.26 | 63.57 | 63.41 | 807 | +| pmid | 66.96 | 75.63 | 71.03 | 2093 | +| title | 94.52 | 90.5 | 92.47 | 92423 | +| volume | 95.57 | 92.79 | 94.16 | 87671 | +| | | | | | +| **all fields (micro avg.)** | **92.42** | **85.12** | **88.62** | 706972 | +| all fields (macro avg.) | 86.74 | 82.59 | 84.45 | 706972 | + + +#### Instance-level results + +``` +Total expected instances: 98753 +Total extracted instances: 103498 +Total correct instances: 41273 (strict) +Total correct instances: 51887 (soft) +Total correct instances: 56012 (Levenshtein) +Total correct instances: 52881 (RatcliffObershelp) + +Instance-level precision: 39.88 (strict) +Instance-level precision: 50.13 (soft) +Instance-level precision: 54.12 (Levenshtein) +Instance-level precision: 51.09 (RatcliffObershelp) + +Instance-level recall: 41.79 (strict) +Instance-level recall: 52.54 (soft) +Instance-level recall: 56.72 (Levenshtein) +Instance-level recall: 53.55 (RatcliffObershelp) + +Instance-level f-score: 40.81 (strict) +Instance-level f-score: 51.31 (soft) +Instance-level f-score: 55.39 (Levenshtein) +Instance-level f-score: 52.29 (RatcliffObershelp) + +Matching 1 : 75680 + +Matching 2 : 4230 + +Matching 3 : 6093 + +Matching 4 : 2284 + +Total matches : 88287 +``` + + +#### Citation context resolution +``` + +Total expected references: 98712 - 49.38 references per article +Total predicted references: 103442 - 51.75 references per article + +Total expected citation contexts: 142737 - 71.4 citation contexts per article +Total predicted citation contexts: 134945 - 67.51 citation contexts per article + +Total correct predicted citation contexts: 111261 - 55.66 citation contexts per article +Total wrong predicted citation contexts: 23684 (wrong callout matching, callout missing in NLM, or matching with a bib. ref. not aligned with a bib.ref. in NLM) + +Precision citation contexts: 82.45 +Recall citation contexts: 77.95 +fscore citation contexts: 80.14 +``` + + +## Fulltext structures + +Fulltext structure contents are complicated to capture from JATS NLM files. They are often normalized and different from the actual PDF content and are can be inconsistent from one document to another. The scores of the following metrics are thus not very meaningful in absolute term, in particular for the strict matching (textual content of the srtructure can be very long). As relative values for comparing different models, they seem however useful. + + +Evaluation on 1999 random PDF files out of 2000 PDF (ratio 1.0). + +#### Strict Matching (exact matches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| figure_title | 4.13 | 3.57 | 3.83 | 13162 | +| reference_citation | 70.63 | 70.48 | 70.55 | 147404 | +| reference_figure | 73.72 | 65.91 | 69.6 | 47965 | +| reference_table | 48.12 | 80.66 | 60.28 | 5951 | +| section_title | 71.28 | 71.04 | 71.16 | 32384 | +| table_title | 4.54 | 4.09 | 4.3 | 2957 | +| | | | | | +| **all fields (micro avg.)** | **66.55** | **65.6** | **66.07** | 249823 | +| all fields (macro avg.) | 45.4 | 49.29 | 46.62 | 249823 | + + + +#### Soft Matching (ignoring punctuation, case and space characters mismatches) + +**Field-level results** + +| label | precision | recall | f1 | support | +|--- |--- |--- |--- |--- | +| figure_title | 67.36 | 58.3 | 62.51 | 13162 | +| reference_citation | 82.28 | 82.1 | 82.19 | 147404 | +| reference_figure | 74.42 | 66.53 | 70.25 | 47965 | +| reference_table | 48.55 | 81.38 | 60.81 | 5951 | +| section_title | 75.06 | 74.8 | 74.93 | 32384 | +| table_title | 50.73 | 45.72 | 48.1 | 2957 | +| | | | | | +| **all fields (micro avg.)** | **77.57** | **76.47** | **77.01** | 249823 | +| all fields (macro avg.) | 66.4 | 68.14 | 66.47 | 249823 | + +Evaluation metrics produced in 1132.55 seconds + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + -The following end-to-end results are using **CRF Wapiti** only as sequence labelling engine. Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service are similar but much slower). However, giving that preprints are processed, the consolidation usually fails because biblio-glutton uses an old snapshot of the CrossRef metadata (from end of 2019). -More recent versions of these benchmarks with variants and **Deep Learning models** (e.g. newer master snapshots) are available [here](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc). Deep Learning models provide higher accuracy at the cost of slower runtime and more expensive CPU resources. -Evaluation on 1998 PDF preprints out of 2000 (1 PDF parsing timeout and 1 PDF "too many blocks" interruption). -Runtime for processing 2000 PDF: **1321s** (1,51 PDF per second) on Ubuntu 16.04, 4 CPU i7-4790K (8 threads), 16GB RAM (workstation bought in 2015 for 1600 euros). diff --git a/doc/Benchmarking-pmc.md b/doc/Benchmarking-pmc.md index a2f493704d..c30da2a1b8 100644 --- a/doc/Benchmarking-pmc.md +++ b/doc/Benchmarking-pmc.md @@ -2,19 +2,23 @@ ## General -This is the end-to-end benchmarking result for GROBID version **0.6.2** against the `PMC_sample_1943` dataset, see the [End-to-end evaluation](End-to-end-evaluation.md) page for explanations and for reproducing this evaluation. +This is the end-to-end benchmarking result for GROBID version **0.7.0** against the `PMC_sample_1943` dataset, see the [End-to-end evaluation](End-to-end-evaluation.md) page for explanations and for reproducing this evaluation. -The following end-to-end results are using **CRF Wapiti** only as sequence labelling engine. Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service are similar but much slower). +The following end-to-end results are using: +- **BidLSTM-CRF-FEATURES** as sequence labeling for the citation model +- **CRF Wapiti** as sequence labelling engine for all other models. -More recent versions of these benchmarks with variants and **Deep Learning models** (e.g. newer master snapshots) are available [here](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc). Deep Learning models provide higher accuracy at the cost of slower runtime and more expensive CPU resources. +Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service should be similar but much slower). + +Other versions of these benchmarks with variants and **Deep Learning models** (e.g. newer master snapshots) are available [here](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc). Note that Deep Learning models might provide higher accuracy, but at the cost of slower runtime and more expensive CPU/GPU resources. Evaluation on 1943 random PDF PMC files out of 1943 PDF from 1943 different journals (0 PDF parsing failure). -Runtime for processing 1943 PDF: **836s** (2,33 PDF per second) on Ubuntu 16.04, 4 CPU i7-4790K (8 threads), 16GB RAM (workstation bought in 2015 for 1600 euros). +Runtime for processing 1943 PDF: **797s** (2.44 PDF per second) on Ubuntu 16.04, 4 CPU i7-4790K (8 threads), 16GB RAM (workstation bought in 2015 for 1600 euros) and with a GeForce GTX 1050 Ti GPU. ## Header metadata -Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). +Evaluation on 1943 random PDF files out of 1943 PDF (ratio 1.0). #### Strict Matching (exact matches) @@ -22,15 +26,14 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| abstract | 15.85 | 15.59 | 15.72 | 1911 | -| authors | 92.3 | 92.07 | 92.18 | 1941 | -| first_author | 95.97 | 95.72 | 95.85 | 1941 | -| keywords | 66.5 | 58.12 | 62.03 | 1380 | -| title | 86.92 | 86.21 | 86.56 | 1943 | +| abstract | 16.11 | 15.8 | 15.95 | 1911 | +| authors | 93.07 | 92.68 | 92.88 | 1941 | +| first_author | 96.07 | 95.67 | 95.87 | 1941 | +| keywords | 68.26 | 64.06 | 66.09 | 1380 | +| title | 86.84 | 86.62 | 86.73 | 1943 | | | | | | | -| **all fields (micro avg.)** | **72.26** | **70.43** | **71.33** | 9116 | -| all fields (macro avg.) | 71.51 | 69.54 | 70.47 | 9116 | - +| **all fields (micro avg.)** | **72.71** | **71.58** | **72.14** | 9116 | +| all fields (macro avg.) | 72.07 | 70.97 | 71.5 | 9116 | #### Soft Matching (ignoring punctuation, case and space characters mismatches) @@ -39,15 +42,14 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| abstract | 59.2 | 58.24 | 58.72 | 1911 | -| authors | 92.77 | 92.53 | 92.65 | 1941 | -| first_author | 96.07 | 95.83 | 95.95 | 1941 | -| keywords | 75.46 | 65.94 | 70.38 | 1380 | -| title | 94.45 | 93.67 | 94.06 | 1943 | +| abstract | 60.91 | 59.76 | 60.33 | 1911 | +| authors | 93.53 | 93.15 | 93.34 | 1941 | +| first_author | 96.17 | 95.78 | 95.97 | 1941 | +| keywords | 76.53 | 71.81 | 74.09 | 1380 | +| title | 94.74 | 94.49 | 94.61 | 1943 | | | | | | | -| **all fields (micro avg.)** | **84.4** | **82.26** | **83.32** | 9116 | -| all fields (macro avg.) | 83.59 | 81.24 | 82.35 | 9116 | - +| **all fields (micro avg.)** | **85.09** | **83.76** | **84.42** | 9116 | +| all fields (macro avg.) | 84.37 | 83 | 83.67 | 9116 | #### Levenshtein Matching (Minimum Levenshtein distance at 0.8) @@ -56,15 +58,14 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| abstract | 86.81 | 85.4 | 86.1 | 1911 | -| authors | 95.92 | 95.67 | 95.8 | 1941 | -| first_author | 96.38 | 96.14 | 96.26 | 1941 | -| keywords | 85.24 | 74.49 | 79.51 | 1380 | -| title | 97.41 | 96.6 | 97 | 1943 | +| abstract | 88.48 | 86.81 | 87.64 | 1911 | +| authors | 96.43 | 96.03 | 96.23 | 1941 | +| first_author | 96.48 | 96.08 | 96.28 | 1941 | +| keywords | 86.02 | 80.72 | 83.29 | 1380 | +| title | 97.83 | 97.58 | 97.71 | 1943 | | | | | | | -| **all fields (micro avg.)** | **92.97** | **90.61** | **91.77** | 9116 | -| all fields (macro avg.) | 92.35 | 89.66 | 90.93 | 9116 | - +| **all fields (micro avg.)** | **93.58** | **92.12** | **92.85** | 9116 | +| all fields (macro avg.) | 93.05 | 91.45 | 92.23 | 9116 | #### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) @@ -73,35 +74,34 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| abstract | 82.23 | 80.9 | 81.56 | 1911 | -| authors | 94.37 | 94.13 | 94.25 | 1941 | -| first_author | 95.97 | 95.72 | 95.85 | 1941 | -| keywords | 81.01 | 70.8 | 75.56 | 1380 | -| title | 97.04 | 96.24 | 96.64 | 1943 | +| abstract | 84.64 | 83.05 | 83.84 | 1911 | +| authors | 94.98 | 94.59 | 94.79 | 1941 | +| first_author | 96.07 | 95.67 | 95.87 | 1941 | +| keywords | 81.93 | 76.88 | 79.33 | 1380 | +| title | 97.37 | 97.12 | 97.24 | 1943 | | | | | | | -| **all fields (micro avg.)** | **90.92** | **88.61** | **89.75** | 9116 | -| all fields (macro avg.) | 90.13 | 87.56 | 88.77 | 9116 | +| **all fields (micro avg.)** | **91.69** | **90.26** | **90.97** | 9116 | +| all fields (macro avg.) | 91 | 89.46 | 90.21 | 9116 | #### Instance-level results ``` Total expected instances: 1943 -Total correct instances: 202 (strict) -Total correct instances: 797 (soft) -Total correct instances: 1259 (Levenshtein) -Total correct instances: 1153 (ObservedRatcliffObershelp) - -Instance-level recall: 10.4 (strict) -Instance-level recall: 41.02 (soft) -Instance-level recall: 64.8 (Levenshtein) -Instance-level recall: 59.34 (RatcliffObershelp) +Total correct instances: 218 (strict) +Total correct instances: 869 (soft) +Total correct instances: 1365 (Levenshtein) +Total correct instances: 1256 (ObservedRatcliffObershelp) + +Instance-level recall: 11.22 (strict) +Instance-level recall: 44.72 (soft) +Instance-level recall: 70.25 (Levenshtein) +Instance-level recall: 64.64 (RatcliffObershelp) ``` - ## Citation metadata -Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). +Evaluation on 1943 random PDF files out of 1943 PDF (ratio 1.0). #### Strict Matching (exact matches) @@ -109,18 +109,17 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| authors | 84.49 | 74.79 | 79.35 | 85778 | -| date | 93.28 | 81.48 | 86.98 | 87067 | -| first_author | 90.99 | 80.53 | 85.44 | 85778 | -| inTitle | 72.07 | 69.53 | 70.78 | 81007 | -| issue | 89.27 | 82.97 | 86 | 16635 | -| page | 94.85 | 83.56 | 88.85 | 80501 | -| title | 78.97 | 72.17 | 75.42 | 80736 | -| volume | 95.44 | 87.12 | 91.09 | 80067 | +| authors | 82.47 | 75.38 | 78.77 | 85778 | +| date | 94.46 | 82.98 | 88.35 | 87067 | +| first_author | 89.11 | 81.43 | 85.09 | 85778 | +| inTitle | 72.17 | 70.95 | 71.56 | 81007 | +| issue | 89.04 | 83.14 | 85.99 | 16635 | +| page | 95.94 | 85.15 | 90.22 | 80501 | +| title | 79 | 74.48 | 76.67 | 80736 | +| volume | 95.92 | 89.01 | 92.34 | 80067 | | | | | | | -| **all fields (micro avg.)** | **87.07** | **78.58** | **82.61** | 597569 | -| all fields (macro avg.) | 87.42 | 79.02 | 82.99 | 597569 | - +| **all fields (micro avg.)** | **86.86** | **79.99** | **83.29** | 597569 | +| all fields (macro avg.) | 87.26 | 80.31 | 83.62 | 597569 | #### Soft Matching (ignoring punctuation, case and space characters mismatches) @@ -129,18 +128,17 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| authors | 85.18 | 75.41 | 80 | 85778 | -| date | 93.28 | 81.48 | 86.98 | 87067 | -| first_author | 91.28 | 80.78 | 85.71 | 85778 | -| inTitle | 83.61 | 80.66 | 82.11 | 81007 | -| issue | 89.27 | 82.97 | 86 | 16635 | -| page | 94.85 | 83.56 | 88.85 | 80501 | -| title | 90.08 | 82.34 | 86.03 | 80736 | -| volume | 95.44 | 87.12 | 91.09 | 80067 | +| authors | 83.01 | 75.87 | 79.28 | 85778 | +| date | 94.46 | 82.98 | 88.35 | 87067 | +| first_author | 89.3 | 81.6 | 85.28 | 85778 | +| inTitle | 83.54 | 82.13 | 82.83 | 81007 | +| issue | 89.04 | 83.14 | 85.99 | 16635 | +| page | 95.94 | 85.15 | 90.22 | 80501 | +| title | 90.45 | 85.28 | 87.79 | 80736 | +| volume | 95.92 | 89.01 | 92.34 | 80067 | | | | | | | -| **all fields (micro avg.)** | **90.4** | **81.59** | **85.77** | 597569 | -| all fields (macro avg.) | 90.38 | 81.79 | 85.85 | 597569 | - +| **all fields (micro avg.)** | **90.2** | **83.06** | **86.48** | 597569 | +| all fields (macro avg.) | 90.21 | 83.15 | 86.51 | 597569 | #### Levenshtein Matching (Minimum Levenshtein distance at 0.8) @@ -149,18 +147,17 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| authors | 89.89 | 79.57 | 84.42 | 85778 | -| date | 93.28 | 81.48 | 86.98 | 87067 | -| first_author | 91.35 | 80.85 | 85.78 | 85778 | -| inTitle | 84.63 | 81.65 | 83.11 | 81007 | -| issue | 89.27 | 82.97 | 86 | 16635 | -| page | 94.85 | 83.56 | 88.85 | 80501 | -| title | 93.24 | 85.22 | 89.05 | 80736 | -| volume | 95.44 | 87.12 | 91.09 | 80067 | +| authors | 88.31 | 80.72 | 84.35 | 85778 | +| date | 94.46 | 82.98 | 88.35 | 87067 | +| first_author | 89.5 | 81.79 | 85.47 | 85778 | +| inTitle | 84.84 | 83.4 | 84.11 | 81007 | +| issue | 89.04 | 83.14 | 85.99 | 16635 | +| page | 95.94 | 85.15 | 90.22 | 80501 | +| title | 92.83 | 87.52 | 90.1 | 80736 | +| volume | 95.92 | 89.01 | 92.34 | 80067 | | | | | | | -| **all fields (micro avg.)** | **91.66** | **82.72** | **86.96** | 597569 | -| all fields (macro avg.) | 91.5 | 82.8 | 86.91 | 597569 | - +| **all fields (micro avg.)** | **91.5** | **84.26** | **87.73** | 597569 | +| all fields (macro avg.) | 91.36 | 84.21 | 87.62 | 597569 | #### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) @@ -169,80 +166,78 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| authors | 87.3 | 77.28 | 81.99 | 85778 | -| date | 93.28 | 81.48 | 86.98 | 87067 | -| first_author | 91.01 | 80.54 | 85.46 | 85778 | -| inTitle | 82.2 | 79.3 | 80.73 | 81007 | -| issue | 89.27 | 82.97 | 86 | 16635 | -| page | 94.85 | 83.56 | 88.85 | 80501 | -| title | 92.26 | 84.33 | 88.12 | 80736 | -| volume | 95.44 | 87.12 | 91.09 | 80067 | +| authors | 85.37 | 78.03 | 81.54 | 85778 | +| date | 94.46 | 82.98 | 88.35 | 87067 | +| first_author | 89.13 | 81.44 | 85.11 | 85778 | +| inTitle | 82.17 | 80.78 | 81.47 | 81007 | +| issue | 89.04 | 83.14 | 85.99 | 16635 | +| page | 95.94 | 85.15 | 90.22 | 80501 | +| title | 92.38 | 87.1 | 89.66 | 80736 | +| volume | 95.92 | 89.01 | 92.34 | 80067 | | | | | | | -| **all fields (micro avg.)** | **90.76** | **81.91** | **86.11** | 597569 | -| all fields (macro avg.) | 90.7 | 82.07 | 86.15 | 597569 | +| **all fields (micro avg.)** | **90.58** | **83.41** | **86.85** | 597569 | +| all fields (macro avg.) | 90.55 | 83.45 | 86.83 | 597569 | #### Instance-level results ``` Total expected instances: 90125 -Total extracted instances: 88824 -Total correct instances: 38412 (strict) -Total correct instances: 49926 (soft) -Total correct instances: 54186 (Levenshtein) -Total correct instances: 51130 (RatcliffObershelp) +Total extracted instances: 87994 +Total correct instances: 39070 (strict) +Total correct instances: 50916 (soft) +Total correct instances: 55618 (Levenshtein) +Total correct instances: 52284 (RatcliffObershelp) -Instance-level precision: 43.25 (strict) -Instance-level precision: 56.21 (soft) -Instance-level precision: 61 (Levenshtein) -Instance-level precision: 57.56 (RatcliffObershelp) +Instance-level precision: 44.4 (strict) +Instance-level precision: 57.86 (soft) +Instance-level precision: 63.21 (Levenshtein) +Instance-level precision: 59.42 (RatcliffObershelp) -Instance-level recall: 42.62 (strict) -Instance-level recall: 55.4 (soft) -Instance-level recall: 60.12 (Levenshtein) -Instance-level recall: 56.73 (RatcliffObershelp) +Instance-level recall: 43.35 (strict) +Instance-level recall: 56.49 (soft) +Instance-level recall: 61.71 (Levenshtein) +Instance-level recall: 58.01 (RatcliffObershelp) -Instance-level f-score: 42.93 (strict) -Instance-level f-score: 55.8 (soft) -Instance-level f-score: 60.56 (Levenshtein) -Instance-level f-score: 57.14 (RatcliffObershelp) +Instance-level f-score: 43.87 (strict) +Instance-level f-score: 57.17 (soft) +Instance-level f-score: 62.45 (Levenshtein) +Instance-level f-score: 58.71 (RatcliffObershelp) -Matching 1 : 64923 +Matching 1 : 67183 -Matching 2 : 4694 +Matching 2 : 4042 -Matching 3 : 2744 +Matching 3 : 2332 -Matching 4 : 681 +Matching 4 : 739 -Total matches : 73042 +Total matches : 74296 ``` - #### Citation context resolution ``` Total expected references: 90125 - 46.38 references per article -Total predicted references: 88824 - 45.71 references per article +Total predicted references: 87994 - 45.29 references per article Total expected citation contexts: 139835 - 71.97 citation contexts per article -Total predicted citation contexts: 120560 - 62.05 citation contexts per article +Total predicted citation contexts: 121136 - 62.34 citation contexts per article -Total correct predicted citation contexts: 98016 - 50.45 citation contexts per article -Total wrong predicted citation contexts: 22544 (wrong callout matching, callout missing in NLM, or matching with a bib. ref. not aligned with a bib.ref. in NLM) +Total correct predicted citation contexts: 100034 - 51.48 citation contexts per article +Total wrong predicted citation contexts: 21102 (wrong callout matching, callout missing in NLM, or matching with a bib. ref. not aligned with a bib.ref. in NLM) -Precision citation contexts: 81.3 -Recall citation contexts: 70.09 -fscore citation contexts: 75.28 +Precision citation contexts: 82.58 +Recall citation contexts: 71.54 +fscore citation contexts: 76.66 ``` - ## Fulltext structures Fulltext structure contents are complicated to capture from JATS NLM files. They are often normalized and different from the actual PDF content and are can be inconsistent from one document to another. The scores of the following metrics are thus not very meaningful in absolute term, in particular for the strict matching (textual content of the srtructure can be very long). As relative values for comparing different models, they seem however useful. -Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). +Evaluation on 1943 random PDF files out of 1943 PDF (ratio 1.0). #### Strict Matching (exact matches) @@ -250,16 +245,15 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| figure_title | 27.92 | 21.75 | 24.45 | 7058 | -| reference_citation | 57.31 | 58.97 | 58.13 | 134196 | -| reference_figure | 63.44 | 63.91 | 63.67 | 19330 | -| reference_table | 82.74 | 84.21 | 83.47 | 7327 | -| section_title | 75.63 | 67.1 | 71.11 | 27619 | -| table_title | 57.69 | 54.84 | 56.23 | 3784 | +| figure_title | 30.89 | 25.49 | 27.93 | 7058 | +| reference_citation | 57.33 | 59.18 | 58.24 | 134196 | +| reference_figure | 64.42 | 63.15 | 63.78 | 19330 | +| reference_table | 82.75 | 83.81 | 83.28 | 7327 | +| section_title | 77.06 | 67.58 | 72.01 | 27619 | +| table_title | 57.17 | 53.12 | 55.07 | 3784 | | | | | | | -| **all fields (micro avg.)** | **60.32** | **60.11** | **60.22** | 199314 | -| all fields (macro avg.) | 60.79 | 58.46 | 59.51 | 199314 | - +| **all fields (micro avg.)** | **60.59** | **60.32** | **60.46** | 199314 | +| all fields (macro avg.) | 61.6 | 58.72 | 60.05 | 199314 | #### Soft Matching (ignoring punctuation, case and space characters mismatches) @@ -268,15 +262,14 @@ Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0). | label | precision | recall | f1 | support | |--- |--- |--- |--- |--- | -| figure_title | 73.19 | 57 | 64.09 | 7058 | -| reference_citation | 61.47 | 63.25 | 62.34 | 134196 | -| reference_figure | 63.97 | 64.44 | 64.21 | 19330 | -| reference_table | 82.9 | 84.37 | 83.63 | 7327 | -| section_title | 80.56 | 71.47 | 75.75 | 27619 | -| table_title | 80.51 | 76.53 | 78.47 | 3784 | +| figure_title | 79.17 | 65.33 | 71.59 | 7058 | +| reference_citation | 61.41 | 63.39 | 62.38 | 134196 | +| reference_figure | 65 | 63.71 | 64.35 | 19330 | +| reference_table | 82.9 | 83.96 | 83.43 | 7327 | +| section_title | 81.97 | 71.88 | 76.59 | 27619 | +| table_title | 81.85 | 76.06 | 78.85 | 3784 | | | | | | | -| **all fields (micro avg.)** | **65.54** | **65.31** | **65.43** | 199314 | -| all fields (macro avg.) | 73.77 | 69.51 | 71.41 | 199314 | - -Evaluation metrics produced in 924.038 seconds +| **all fields (micro avg.)** | **65.95** | **65.66** | **65.8** | 199314 | +| all fields (macro avg.) | 75.38 | 70.72 | 72.86 | 199314 | +Evaluation metrics produced in 916.228 seconds diff --git a/doc/Grobid-batch.md b/doc/Grobid-batch.md index e1a66973e6..229e1b0bbe 100644 --- a/doc/Grobid-batch.md +++ b/doc/Grobid-batch.md @@ -18,7 +18,7 @@ The following command display some help for the batch commands: Be sure to replace `` with the current version of GROBID that you have installed and built. For example: ```bash -> java -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -h +> java -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -h ``` The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine). @@ -40,7 +40,7 @@ The needed parameters for that command are: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -64,7 +64,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -78,7 +78,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" ``` ### processAuthorsHeader @@ -90,7 +90,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" ``` ### processAuthorsCitation @@ -102,7 +102,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" ``` ### processAffiliation @@ -114,7 +114,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" ``` ### processRawReference @@ -126,7 +126,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" ``` ### processReferences @@ -142,7 +142,7 @@ Example: Example: ```bash -> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences +> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -158,7 +158,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 ``` WARNING: extension of the ST.36 files to be processed must be .xml @@ -174,7 +174,7 @@ WARNING: extension of the ST.36 files to be processed must be .xml Example: ``` -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT ``` WARNING: extension of the text files to be processed must be .txt, and expected encoding is UTF-8 @@ -190,7 +190,7 @@ WARNING: extension of the text files to be processed must be .txt, and expected Example: ``` -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF ``` WARNING: extension of the text files to be processed must be .pdf @@ -206,7 +206,7 @@ WARNING: extension of the text files to be processed must be .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -222,7 +222,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -240,7 +240,7 @@ The needed parameters for that command are: Example: ``` -> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation +> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation ``` WARNING: extension of the text files to be processed must be .pdf diff --git a/doc/Grobid-docker.md b/doc/Grobid-docker.md index f42bd069ac..14b3d606e0 100644 --- a/doc/Grobid-docker.md +++ b/doc/Grobid-docker.md @@ -51,7 +51,7 @@ The process for retrieving and running the image is as follow: Current latest version: ```bash -> docker pull grobid/grobid:0.6.2 +> docker pull grobid/grobid:0.7.0 ``` - Run the container: @@ -87,7 +87,7 @@ Grobid web services are then available as described in the [service documentatio The simplest way to pass a modified configuration to the docker image is to mount the yaml GROBID config file `grobid.yaml` when running the image. Modify the config file `grobid/grobid-home/config/grobid.yaml` according to your requirements on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all --init -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.0-SNAPSHOT +docker run --rm --gpus all --init -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.1-SNAPSHOT ``` You need to use an absolute path to specify your modified `grobid.yaml` file. @@ -172,19 +172,19 @@ However if you are interested in using the master version of Grobid in container For building a CRF-only image, the dockerfile to be used is `./Dockerfile.crf`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.6.2 --build-arg GROBID_VERSION=0.6.2 --file Dockerfile.crf . +> docker build -t grobid/grobid:0.7.0 --build-arg GROBID_VERSION=0.7.0 --file Dockerfile.crf . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -> docker build -t grobid/grobid:0.7.0-SNAPSHOT --build-arg GROBID_VERSION=0.7.0-SNAPSHOT --file Dockerfile.crf . +> docker build -t grobid/grobid:0.7.1-SNAPSHOT --build-arg GROBID_VERSION=0.7.1-SNAPSHOT --file Dockerfile.crf . ``` -In order to run the container of the newly created image, for example for version `0.6.2`: +In order to run the container of the newly created image, for example for version `0.7.0`: ```bash -> docker run -t --rm --init -p 8080:8070 -p 8081:8071 grobid/grobid:0.6.2 +> docker run -t --rm --init -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.0 ``` For testing or debugging purposes, you can connect to the container with a bash shell (logs are under `/opt/grobid/logs/`): @@ -209,28 +209,28 @@ In order to build an image supporting GPU, you need: Without these two requirements, the image will always default to CPU, even if GPU are available on the host machine running the image. -For building a CRF-only image, the dockerfile to be used is `./Dockerfile.delft`. The only important information then is the version which will be checked out from the tags. +For building a CRF-only image, the dockerfile to be used is `./Dockerfile.crf` (see previous section). For being able to use both CRF and Deep Learningmodels, use the dockerfile `./Dockerfile.delft`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.6.2 --build-arg GROBID_VERSION=0.6.2 --file Dockerfile.delft . +> docker build -t grobid/grobid:0.7.0 --build-arg GROBID_VERSION=0.7.0 --file Dockerfile.delft . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -docker build -t grobid/grobid:0.7.0-SNAPSHOT --build-arg GROBID_VERSION=0.7.0-SNAPSHOT --file Dockerfile.delft . +docker build -t grobid/grobid:0.7.1-SNAPSHOT --build-arg GROBID_VERSION=0.7.1-SNAPSHOT --file Dockerfile.delft . ``` -In order to run the container of the newly created image, for example for the development version `0.7.0-SNAPSHOT`, using all GPU available: +In order to run the container of the newly created image, for example for the development version `0.7.1-SNAPSHOT`, using all GPU available: ```bash -> docker run --rm --gpus all --init -p 8070:8080 -p 8071:8081 grobid/grobid:0.7.0-SNAPSHOT +> docker run --rm --gpus all --init -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.1-SNAPSHOT ``` In practice, you need to indicate which models should use a Deep Learning model implementation and which ones can remain with a faster CRF model implementation, which is done currently in the `grobid.yaml` file. Modify the config file `grobid/grobid-home/config/grobid.yaml` accordingly on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all --init -p 8070:8080 -p 8071:8081 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.0-SNAPSHOT +docker run --rm --gpus all --init -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.1-SNAPSHOT ``` You need to use an absolute path to specify your modified `grobid.yaml` file. diff --git a/doc/Grobid-java-library.md b/doc/Grobid-java-library.md index 606ca6cc2c..b1c42256ff 100644 --- a/doc/Grobid-java-library.md +++ b/doc/Grobid-java-library.md @@ -31,20 +31,20 @@ Here an example of grobid-core dependency: org.grobid grobid-core - 0.6.2 + 0.7.0 ``` If you want to work on a SNAPSHOT development version, you need to include in your pom file the path to the Grobid jar file, -for instance as follow (if necessary replace `0.6.2` by the valid ``): +for instance as follow (if necessary replace `0.7.0` by the valid ``): ```xml org.grobid grobid-core - 0.6.2 + 0.7.0 system - ${project.basedir}/lib/grobid-core-0.6.2.jar + ${project.basedir}/lib/grobid-core-0.7.0.jar ``` @@ -62,8 +62,8 @@ Add the following snippet in your gradle.build file: and add the Grobid dependency as well: ``` - compile 'org.grobid:grobid-core:0.6.2' - compile 'org.grobid:grobid-trainer:0.6.2' + compile 'org.grobid:grobid-core:0.7.0' + compile 'org.grobid:grobid-trainer:0.7.0' ``` diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md index 7565cbd619..3b1ab0643f 100644 --- a/doc/Grobid-service.md +++ b/doc/Grobid-service.md @@ -23,9 +23,9 @@ You could also build and install the service as a standalone service (let's supp cd .. mkdir grobid-installation cd grobid-installation -unzip ../grobid/grobid-service/build/distributions/grobid-service-0.6.2.zip -mv grobid-service-0.6.2 grobid-service -unzip ../grobid/grobid-home/build/distributions/grobid-home-0.6.2.zip +unzip ../grobid/grobid-service/build/distributions/grobid-service-0.7.0.zip +mv grobid-service-0.7.0 grobid-service +unzip ../grobid/grobid-home/build/distributions/grobid-home-0.7.0.zip ./grobid-service/bin/grobid-service ``` diff --git a/doc/Install-Grobid.md b/doc/Install-Grobid.md index bfa7f6d862..dfd5e191d5 100644 --- a/doc/Install-Grobid.md +++ b/doc/Install-Grobid.md @@ -6,17 +6,17 @@ GROBID requires a JVM installed on your machine, supported version is **JVM 8**. ###Latest stable release -The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.6.2``` which can be downloaded as follow: +The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.7.0``` which can be downloaded as follow: ```bash -> wget https://github.com/kermitt2/grobid/archive/0.6.2.zip -> unzip 0.6.2.zip +> wget https://github.com/kermitt2/grobid/archive/0.7.0.zip +> unzip 0.7.0.zip ``` or using the [docker](Grobid-docker.md) container. ###Current development version -The current development version is ```0.7.0-SNAPSHOT```, which can be downloaded from GitHub and built as follow: +The current development version is ```0.7.1-SNAPSHOT```, which can be downloaded from GitHub and built as follow: Clone source code from github: ```bash diff --git a/doc/Introduction.md b/doc/Introduction.md index 934fa08dad..9de66fda4d 100644 --- a/doc/Introduction.md +++ b/doc/Introduction.md @@ -6,7 +6,12 @@ [![Build Status](https://travis-ci.org/kermitt2/grobid.svg?branch=master)](https://travis-ci.org/kermitt2/grobid) [![Coverage Status](https://coveralls.io/repos/kermitt2/grobid/badge.svg)](https://coveralls.io/r/kermitt2/grobid) [![Documentation Status](https://readthedocs.org/projects/grobid/badge/?version=latest)](https://readthedocs.org/projects/grobid/?badge=latest) -[![Docker Status](https://images.microbadger.com/badges/version/lfoppiano/grobid.svg)](https://hub.docker.com/r/lfoppiano/grobid/ "Latest Docker HUB image") +[![GitHub release](https://img.shields.io/github/release/kermitt2/grobid.svg)](https://github.com/kermitt2/grobid/releases/) +[![Release](https://jitpack.io/v/kermitt2/grobid.svg)](https://jitpack.io/#kermitt2/grobid) +[![Demo cloud.science-miner.com/grobid](https://img.shields.io/website-up-down-green-red/https/cloud.science-miner.com/grobid.svg)](http://cloud.science-miner.com/grobid) +[![Docker Hub](https://img.shields.io/docker/pulls/lfoppiano/grobid.svg)](https://hub.docker.com/r/lfoppiano/grobid/ "Docker Pulls") +[![Docker Hub](https://img.shields.io/docker/pulls/grobid/grobid.svg)](https://hub.docker.com/r/grobid/grobid/ "Docker Pulls") +[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/kermitt2/grobid/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://github.com/kermitt2/grobid) ## Purpose diff --git a/doc/Principle.md b/doc/Principle.md deleted file mode 100644 index 8f112444af..0000000000 --- a/doc/Principle.md +++ /dev/null @@ -1,4 +0,0 @@ -

Principles

- - - diff --git a/doc/Principles.md b/doc/Principles.md new file mode 100644 index 0000000000..8a1e920dc6 --- /dev/null +++ b/doc/Principles.md @@ -0,0 +1,108 @@ +

How GROBID works

+ +GROBID is a machine learning library for extracting, parsing and re-structuring raw documents in particular PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. The goal of GROBID is to facilitate text mining, information extraction and semantic analysis of scientific publications by transforming them into machine-friendly, structured, and predictable representations. + +In large scale scientific document ingestion tasks, the large majority of documents are only available in PDF (in particular decades of back files before year 2000). Scholar articles are today more frequently available as XML, but often require particular agreements and long negotiations with publishers. PDF remains today the most important format usable under fair-use or under the recent copyrights exception for text mining in the EU. When publisher XML are available, they remain challenging to process because they are encoded in a variety of different native publisher XML formats, often incomplete and inconsistent from one to another, difficult to use at scale. + +![Ingesting scientific documents with GROBID](img/ingestion.png) + +

+Fig. 1 - Ingesting scientific documents with GROBID +

+ +To process publisher XML, complementary to GROBID, we built [Pub2TEI](https://github.com/kermitt2/Pub2TEI), a collection of style sheets developed over 11 years able to transform a variety of publisher XML format to the same TEI XML format as produced by GROBID. This common format, which supersedes a dozen of publisher formats and many of their flavors, can centralize further any processing across PDF and heterogeneous XML sources, and support various applications (see __Fig. 1__). + +The rest of this page gives an overview of main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](https://grobid.readthedocs.io/en/latest/). + +##Document parsing as a cascade of sequence labeling models + +GROBID uses a cascade of sequence labeling models to parse a document. This modular approach makes possible to adapt the training data, the features, the text representations and the models to the different hierarchical structures of the document. Individual models maintain each a small amount of labels (which is easier to manage and train), but, in combination, the full cascade provides very detailed end-result structures. The final models produce 55 different "leaf" labels, while other document analysis layout systems support significantly less label categories (up to 22 for GROTOAP2 dataset and CERMINE, _Tkaczyk et al., 2014_, the maximum to our knowledge after GROBID). + +In GROBID, sequence labeling is defined in an abstract manner and its concrete implementation can be selected among different standard ML architectures, including a fast linear chain CRF and a variety of state-of-the-art Deep Learning (DL) models. Sequence labeling models are limited to the labeling of a linear sequence of tokens, therefore they associate a one-dimension structure to a stream of tokens. One way to create additional levels of nested structures is to cascade several sequence labeling models, the output of a first model being piped to one or several models. This is the approach taken by GROBID. + +__Fig. 2__ shows the current model cascade. Each model typically uses its own combination of sequence labeling algorithm, features, and possibly a different tokenizer. The model architecture and parameters depend on the labels to be used, on the amount of available training data, on the runtime, memory and accuracy constraints, etc. This approach finally helps to mitigate class imbalanced problems, for instance a majority class like "paragraph" will not impact a rare class from a non-body area (e.g. a field appearing only one time in a header) by keeping the imbalanced classes in separated models. + +![The GROBID cascade of sequence labeling models](img/cascade.png) + +

+Fig. 2 - The GROBID cascade of sequence labeling models +

+ +The _segmentation model_ for instance is used to detect the main areas of a document, e.g. the title page, the header, the body, the head and foot notes, the bibliographical sections, etc. This particular model works by labeling each line and heavily rely on layout features. Working at line level is significantly faster than a token-level model, which is good for a model applied to the entire content of the document. The areas introduced by this model correspond to large zones, which are never interrupting a line. + +The header areas detected by the segmentation model (which can be several non-continuous areas distributed in several pages) are passed to the _header model_. The header model is trained to recognized information like title, authors, affiliation, abstract, etc. This model works at layout token level. As it processes a smaller amount of text, working at token-level is less impactful regarding runtime, and the model can use a larger amount of training examples. + +Some models can be used at several locations in the document. For instance the _date model_, used to segment a raw date into years, month, etc. and to provide a normalized ISO date, is called when dates are identified in the header area, but also when parsing a reference zone. Similarly the figure or table model are used to sub-structure every figures and tables of a document. A GROBID model is thus context free. + +The structuring of the same entity type however can depend on the position of this entity. For instance author names in the header and author names in a reference string are expressed in a different manner. Author names in the header usually use full names and are associated with affiliation markers. Author names in reference string are usually much shorter and never mixed with affiliation information. For this reason, we introduced two different models for name parsing, a `name_header` model and a `name_citation` model. + +Cascading models offers thus the flexibility to tune each model and associated simpler training data to the nature of the structure to be recognized. In addition, it maintains each models small, while resulting in combination to very fine-grained final structures. Finally, although errors from a model can be propagated to another model, we train each model with a certain amount of realistic errors and noise as input (which is anyway more or less always happening with PDF), which makes possible to recover upstream model errors. + +##Layout tokens, not text + +The different GROBID models do not work on text, but on **Layout Tokens** to exploit various visual/layout information available for every tokens. Layout information provide at the same time more criteria of decision for the recognition of structures and more robustness to layout variations. + +GROBID Layout Token is a structure containing the Unicode text token but also the associated available rich text information (font size and name, style attributes - bold, italic, superscript/subscript) and the location in the PDF expressed by bounding boxes. Layout Tokens are grouped following visual criteria (lines, blocks, columns) as a first result of the PDF layout analysis, and then further semantically grouped through the Machine Learning process, following the labeled fields. In addition, these layout information are used to create additional layout features like indentation, relative spacing indicators, relative page vertical and horizontal positions, character density or bitmap/vector graphics relative position information. In most GROBID models, these layout features are set at every layout tokens. + +The layout information are extracted and built by [pdfalto](https://github.com/kermitt2/pdfalto), a PDF parser that provides line, block and various position and style information to GROBID. Complementary to the support of ALTO, a modern format for OCR output, pdfalto handles a variety of cleaning processes: UTF-8 encoding and character composition, the recognition of superscript/subscript style and the robust recognition of line numbers for review manuscripts, the recovery of text order at block level, the detection of columns, etc. The detection of token boundaries, lines and block information are using XY projection and heuristics. pdfalto also extracts embedded bitmap (all converted into PNG) and vector graphics (in SVG), PDF metadata (XMP) and PDF annotations for further usage in GROBID. + +Layout information are used to instantiate layout features, which can be exploited or not depending on the capacity of the ML model implementation. Layout features are useful for the reliable recognition of structures such as titles, abstracts, section titles, figures, tables, reference markers, which are often mostly characterized by their relative position (vertical space, indentation, blocks, etc.) and font style (e.g. superscript for reference markers or title in larger font). + +Dedicated joint Deep Learning models able to exploit these additional layout features have been developed in [DeLFT](https://github.com/kermitt2/delft) to complement CRF models. + +![PDF annotation service with Figure pop-up](img/Screenshot4.png) + +

+Fig. 3 - Visualization of a cited figure in context +

+ + +GROBID models maintains a synchronization between the labeling process and the layout token bounding boxes. It means that as the labeled fields are built via sequence labeling, the bounding boxes of the created structures are also build. Operations on 2D bounding boxes are well known and straight-forward to apply to Layout elements. By synchronizing the bounding boxes with the sequence labeling, we can render any structured results on their original PDF source. More generally, applied to any PDF processing, extracted structures and annotations can include bounding boxes giving precise location in the original document layout. Text mining is then not limited to populating a database, it allows user-friendly visualizations of semantically enriched documents and new interactions. __Fig. 3 and 4__ presents two examples of visualization of extracted objects thanks to GROBID coordinates associated to structures. + +![PDF annotation service with Equation pop-up](img/Screenshot5.png) + +

+Fig. 4 - Visualization of a cited equation in context +

+ +##Training data: _Qualität statt Quantität_ + +GROBID does not use vast amount of training data derived from existing publisher XML documents, like CERMINE _(Tkaczyk et al., 2015)_ or ScienceParse 1 & 2, but small, high quality sets of manually labeled training data. The data to be labeled are directly generated from PDF (not from publisher XML) and continuously extended with error cases. Although we also experimented with the large-set approaches and auto-generated training data at scale, we still currently remain with the small/high quality approach, the reasons being the following ones: + +- Exploiting publisher XML suppose to be able to align the clean XML content with the noisy PDF content. This is complicated to realize in practice at full document scale, because publisher fulltext XML do not follow the actual PDF object stream, some XML elements are encoded very differently from what can be extracted from the PDF (e.g. equations, chemical formula, tables, section titles, references, ...), present only in XML (sometimes keywords in PMC JATS are not in the PDF) or present only in the PDF (cover page, copyrights/editorial statements, head notes). In addition, some spurious template presentation tokens in the PDF are normally absent from the XML because considered as presentation sugar or noise - what they are, they do not carry any useful semantic information. These PDF scoria are however very useful to help the recognition of structures as they can indicate field boundaries. A super large dataset from publisher XML/PDF tends to be closer to the XML than the actual PDF content, because either (i) only PDF very close to the corresponding XML are successfully aligned and kept or (ii) only "easy" document layout segments/pages are kept. + +- With a large amount of training data, the addition of a few new examples has often no generalization impact, because the new examples are diluted in the vast amount of training. It is then in practice impossible to further improve the model with additional training data and to recover errors. On the other hand, with a small training dataset, the addition of a few error cases can correct the model and it is possible to quickly iterate and improve the model continuously in an active learning manner. + +- Using available publisher XML, it is difficult to build a large set of training data presenting a good diversity in domains and layouts. Beyond PMC and preprints, other kind of publications would be needed but they are complicated to harvest at similar scale due to copyright reasons and the mosaic of publishers. In contrast, by building our small training set iteratively with error cases, we introduce preferably documents from domains and publishers badly represented in the current training dataset, and maintain a stronger diversity. + +- High quality training data usually balance well a small training size, quality training data generally improve the learning rate because inconsistent annotations increase artificial aleatoric uncertainty in the model (note: reference needed). + +- A lower amount of training data can keep models smaller (e.g. with CRF), faster to train and thus easier for setting hyperparameters. + +In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000, but GROBID provides comparable or better accuracy scores. To help to ensure high quality training data, we develop detailed [annotation guidelines](training/General-principles/) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :) + +As the training data is crafted for accuracy and coverage, it is strongly biased by undersampling non-edge cases. Our labeled data cannot be used for evaluation. Evaluations are done in separate and stable holdout sets from publishers, which follow more realistic distributions of document variations. Our publisher evaluation sets however present the same lack of diversity drawback as discussed above with training data, but at least we do not train and evaluate with the same domains and sources of publications as most similar works. + +For the moment, we are also not relying on transformer approaches incorporating layout information, like LayoutML _(Xu et al., 2020)_, LayoutLMv2 _(Xu et al., 2021)_, SelfDoc or VILA _(Shen et al., 2021)_, which require considerable GPU capacities, long inference runtime, and do not show at this time convincing accuracy scores as compared to the current GROBID cheap approach (reported accuracy at token level are often lower than GROBID accuracy at field level, while using less labels). However, these approaches are very promising. In GROBID, it is possible to run BERT and SciBERT baseline fine-tuned models, ignoring available layout features. We think the system is thus more or less ready to experiment with fine-tuning such extended transformer models - or rather few-shot learning given the size of our annotated example set - when/if they can surpass some of the current models (and when we will have saved enough money to buy a V100 GPU). + +##Balancing accuracy and scalability + +We develop a tool to process the full scholar literature corpus (several ten millions PDF), but also to allow interactive usage, e.g. processing the header of a PDF article in sub-second. It's why the default configuration of GROBID is still set to CRF to maintain the ability to process PDF quickly, with commodity hardware, with low memory usage to ensure good parallelization and scalabilty capacities. + +However, if the priority is accuracy, we also make possible custom settings to maximize the accuracy with deep learning models. Using some deep learning models will improve results by a few additional F1-score points (nothing extraordinary to be honest), but at the price of a slower runtime (2 to 5 times slower), the price of a GPU and more limited parallelization. + +##References + +_(Tkaczyk et al., 2014)_ Dominika Tkaczyk, Pawel Szostek, and Lukasz Bolikowski. 2014. Grotoap2 - the methodology of creating a large ground truth dataset of scientific articles. D-Lib Magazine, 20(11/12) + +_(Tkaczyk et al., 2015)_ Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Łukasz Bolikowski. 2015. Cermine: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 18(4):317-335 + +[Science Parse](https://github.com/allenai/science-parse), https://github.com/allenai/science-parse + +[Science Parse v2](https://github.com/allenai/spv2), https://github.com/allenai/spv2 + +_(Shen et al., 2021)_ Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, Doug Downey. 2021. [Incorporating Visual Layout Structures for Scientific Text Classification](https://arxiv.org/pdf/2106.00676.pdf). arXiv:2106.00676 + +_(Xu et al., 2020)_ Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318.pdf). KDD 2020 + +_(Xu et al., 2021)_ Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. 2021. [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/pdf/2012.14740.pdf). arXiv:2012.14740 diff --git a/doc/References.md b/doc/References.md index 7edfdfb990..66424ca611 100644 --- a/doc/References.md +++ b/doc/References.md @@ -6,9 +6,10 @@ If you want to cite this work, please simply refer to the github project: GROBID (2008-2021) ``` -Please do not include a particular person name to emphasize the project and the tool! +Please do not include a particular person name to emphasize the project and the tool ! -We also ask you not to cite any old research papers, but the current project itself, because, yes, we can cite a software project in the bibliographical references and not just mention it in a foot note ;) +We also ask you not to cite any old research papers, but the current project itself, because, yes, we can try to cite a software project in the bibliographical references and not just mention it in a foot note ;) +Well, it might be (likely) rejected by reviewers, the editorial style or editors, but at least you tried ! Here's a BibTeX entry using the [Software Heritage](https://www.softwareheritage.org/) project-level permanent identifier: @@ -25,6 +26,8 @@ Here's a BibTeX entry using the [Software Heritage](https://www.softwareheritage ## Presentations on Grobid +The following presentations are reminders that old machine learning stuff is not like good wine. Please use this project repository for up-to-date information. + [GROBID in 30 slides](grobid-04-2015.pdf) (2015). [GROBID in 20 slides](GROBID.pdf) (2012). @@ -33,7 +36,7 @@ P. Lopez. Automatic Extraction and Resolution of Bibliographical References in P ## Evaluation and usages -The following articles are provided for information - it does not mean that we agree with all their statements about Grobid (please refer to the present documentation for the actual features and capacities of the tool) or with all the various methodologies used for evaluation, but they all explore interesting aspects with Grobid. +The following articles are provided for information - it does not mean that we agree with all their statements about Grobid (please refer to the present documentation for the actual features and capacities of the tool) or with all the various methodologies used for evaluation, but they all explore interesting aspects related to Grobid. - M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. [Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents](http://docear.org/papers/Evaluation_of_Header_Metadata_Extraction_Approaches_and_Tools_for_Scientific_PDF_Documents.pdf), in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013. @@ -52,24 +55,55 @@ The following articles are provided for information - it does not mean that we a - Mark Grennan and Joeran Beel. [Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora](https://arxiv.org/pdf/2004.10410.pdf). [arXiv:2004.10410](https://arxiv.org/abs/2004.10410), 2020. +- J.M. Nicholson, M. Mordaunt, P. Lopez, A. Uppala, D. Rosati, N.P. Rodrigues, P. Grabitz, S.C. Rife. +[scite: a smart citation index that displays the context of citations and classifies their intent using deep learning](https://www.biorxiv.org/content/10.1101/2021.03.15.435418v1); bioRxiv preprint 2021. doi: https://doi.org/10.1101/2021.03.15.435418 + ## Articles on CRF for bibliographical reference parsing +For archeological purposes, the first paper has been the main motivation and influence for starting GROBID. + - Fuchun Peng and Andrew McCallum. [Accurate Information Extraction from Research Papers using Conditional Random Fields](https://www.aclweb.org/anthology/N04-1042.pdf). Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004. - Isaac G. Councill, C. Lee Giles, Min-Yen Kan. [ParsCit: An open-source CRF reference string parsing package](http://www.lrec-conf.org/proceedings/lrec2008/pdf/166_paper.pdf). In Proceedings of the Language Resources and Evaluation Conference (LREC), Marrakesh, Morrocco, 2008. -## Other similar Open Source tools +## Datasets + +For end-to-end evaluation: + +- [PMC_sample_1943](https://grobid.s3.amazonaws.com/PMC_sample_1943.zip) + +- [bioRxiv 10k](https://zenodo.org/record/3873702) + +For layout/zoning identification: + +- [GROTOAP2](https://repod.icm.edu.pl/dataset.xhtml?persistentId=doi:10.18150/8527338) + +- [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) + +- [DocBank](https://github.com/doc-analysis/DocBank) + +## Similar open source tools - [parsCit](https://github.com/knmnyn/ParsCit) +- [Neural-ParsCit](https://github.com/WING-NUS/Neural-ParsCit) + - [CERMINE](https://github.com/CeON/CERMINE) - [Science Parse](https://github.com/allenai/science-parse) - [science Parse v2](https://github.com/allenai/spv2) -- [Metatagger](https://github.com/iesl/rexa1-metatagger) - - [BILBO](https://github.com/OpenEdition/bilbo) -CiteSeerX page on [Scholarly Information Extraction](http://csxstatic.ist.psu.edu/downloads/software#Services) which lists tools and related information (ok now outdated). +## Transformer/Layout joint approaches (open source) + +- [LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm) + +- [LayoutLMv2](https://github.com/microsoft/unilm/tree/master/layoutlmv2) + +- [VILA](https://github.com/allenai/VILA) + +## Other + +Created in the context of [PdfPig](https://github.com/UglyToad/PdfPig), the following page is a great collection of resources on Document Layout Analysis: [https://github.com/BobLd/DocumentLayoutAnalysis](https://github.com/BobLd/DocumentLayoutAnalysis/) diff --git a/doc/img/Screenshot2.png b/doc/img/Screenshot2.png new file mode 100644 index 0000000000..d3a396bd0d Binary files /dev/null and b/doc/img/Screenshot2.png differ diff --git a/doc/img/Screenshot3.png b/doc/img/Screenshot3.png new file mode 100644 index 0000000000..8ee998f4e1 Binary files /dev/null and b/doc/img/Screenshot3.png differ diff --git a/doc/img/Screenshot4.png b/doc/img/Screenshot4.png new file mode 100644 index 0000000000..5b2345666e Binary files /dev/null and b/doc/img/Screenshot4.png differ diff --git a/doc/img/Screenshot5.png b/doc/img/Screenshot5.png new file mode 100644 index 0000000000..c9237cce09 Binary files /dev/null and b/doc/img/Screenshot5.png differ diff --git a/doc/img/cascade.png b/doc/img/cascade.png new file mode 100644 index 0000000000..8bc905ada2 Binary files /dev/null and b/doc/img/cascade.png differ diff --git a/doc/img/ingestion.png b/doc/img/ingestion.png new file mode 100644 index 0000000000..ef807b64e0 Binary files /dev/null and b/doc/img/ingestion.png differ diff --git a/doc/index.md b/doc/index.md index 9d8e7b87d0..999f4a2546 100644 --- a/doc/index.md +++ b/doc/index.md @@ -5,6 +5,8 @@ * [Introduction](Introduction.md) +* [How GROBID works](Principles.md) + * [References](References.md) * [License](License.md) diff --git a/gradle.properties b/gradle.properties index d4f46622c8..cd36e6011a 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,5 +1,4 @@ -#Thu, 21 Apr 2016 18:39:55 +0200 -version=0.7.0-SNAPSHOT +version=0.7.0 # Set workers to 1 that even for parallel builds it works. (I guess the shadow plugin makes some trouble) org.gradle.workers.max=1 # from Java 9+ diff --git a/mkdocs.yml b/mkdocs.yml index 636df6ace3..15cb33d38c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -12,6 +12,7 @@ pages: - Home: 'index.md' - About: - 'Introduction': 'Introduction.md' + - 'How GROBID works': 'Principles.md' - 'References': 'References.md' - 'Licence': 'License.md' - User manual: