diff --git a/CHANGELOG.md b/CHANGELOG.md index 50f31dc010..8d3ceef5e9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,37 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). +## [0.7.2] – 2022-10-29 + +### Added + ++ Explicit identification of data/code availability statements (#951) and funding statements (#959), including when they are located in the header ++ Link footnote and their "callout" marker in full text (#944) ++ Option to consolidate header only with DOI if a DOI is extracted (#742) ++ "Window" application of RNN model for reference-segmenter to cover long bibliographical sections ++ Add dynamic timeout on pdfalto_server (#926) ++ A modest Python script to help to find "interesting" error cases in a repo of JATS/PDF pairs, grobid-home/scripts/select_error_cases.py + +### Changed + ++ Update to DeLFT version 0.3.2 ++ Some more training data (authors in reference, segmentation, citation, reference-segmenter) (including #961, #864) ++ Update of some models, RNN with feature channels and CRF (segmentation, header, reference-segmenter, citation) ++ Review guidelines for segmentation model ++ Better URL matching, using in particular PDF URL annotation in account + +### Fixed + ++ Fix unexpected figure and table labeling in short texts ++ When matching an ORCID to an author, prioritize Crossref info over extracted ORCID from the PDF (#838) ++ Annotation errors for acknowledgement and other minor stuff ++ Fix for Python library loading for Mac ++ Update docker file to support new CUDA key ++ Do not dehyphenize text in superscript or subscript ++ Allow absolute temporary paths ++ Fix redirected stderr from pdfalto not "gobbled" by the java ProcessBuilder call (#923) ++ Other minor fixes + ## [0.7.1] – 2022-04-16 ### Added diff --git a/Dockerfile.delft b/Dockerfile.delft index 2c012f19c4..3e5b8dd80c 100644 --- a/Dockerfile.delft +++ b/Dockerfile.delft @@ -2,14 +2,14 @@ ## See https://grobid.readthedocs.io/en/latest/Grobid-docker/ -## usage example with version 0.7.1-SNAPSHOT: -## docker build -t grobid/grobid:0.7.1-SNAPSHOT --build-arg GROBID_VERSION=0.7.1-SNAPSHOT --file Dockerfile.delft . +## usage example with version 0.7.2-SNAPSHOT: +## docker build -t grobid/grobid:0.7.2-SNAPSHOT --build-arg GROBID_VERSION=0.7.2-SNAPSHOT --file Dockerfile.delft . ## no GPU: -## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.1-SNAPSHOT +## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.2-SNAPSHOT ## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine): -## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/obid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.1-SNAPSHOT +## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/obid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.2-SNAPSHOT # ------------------- # build builder image diff --git a/Readme.md b/Readme.md index 415c3bb84f..1de6078a11 100644 --- a/Readme.md +++ b/Readme.md @@ -25,24 +25,26 @@ The following functionalities are available: - __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.). - __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .89 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.). - __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is above .78 f-score (which corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference). +- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.). +- __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures. - Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model). - __Parsing of names__ (e.g. person title, forenames, middlename, etc.), in particular author names in header, and author names in references (two distinct models). - __Parsing of affiliation and address__ blocks. - __Parsing of dates__, ISO normalized day, month, year. -- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference callout, figure, table, etc.). - __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI resolution performance is higher than 0.95 F1-score from PDF extraction. - __Extraction and parsing of patent and non-patent references in patent__ publications. -- __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF. In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middlenames, affiliation types, detailed address, journal, volume, issue, pages, doi, pmid, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.). -GROBID includes a comprehensive web service API, batch processing, a JAVA API, a Docker image, a generic evaluation framework (precision, recall, etc., n-fold cross-evaluation) and the semi-automatic generation of training data. +GROBID includes a comprehensive web service API, batch processing, a JAVA API, Docker images, a generic evaluation framework (precision, recall, etc., n-fold cross-evaluation) and the semi-automatic generation of training data. GROBID can be considered as production ready. Deployments in production includes ResearchGate, Internet Archive Scholar, HAL Research Archive, INIST-CNRS, CERN (Invenio), scite.ai, Academia.edu and many more. The tool is designed for speed and high scalability in order to address the full scientific literature corpus. GROBID should run properly "out of the box" on Linux (64 bits) and macOS. We cannot ensure currently support for Windows as we did before (help welcome!). -GROBID uses optionnally Deep Learning models relying on the [DeLFT](https://github.com/kermitt2/delft) library, a task-agnostic Deep Learning framework for sequence labelling and text classification, via [JEP](https://github.com/ninia/jep). GROBID can run Deep Learning architectures (with or without layout feature channels) or with feature engineered CRF (default), or any mixtures of CRF and DL to balance scalability and accuracy. These models use joint text and visual/layout information provided by [pdfalto](https://github.com/kermitt2/pdfalto). +GROBID uses Deep Learning models relying on the [DeLFT](https://github.com/kermitt2/delft) library, a task-agnostic Deep Learning framework for sequence labelling and text classification, via [JEP](https://github.com/ninia/jep). GROBID can run Deep Learning architectures (with or without layout feature channels) or with feature engineered CRF (default), or any mixtures of CRF and DL to balance scalability and accuracy. These models use joint text and visual/layout information provided by [pdfalto](https://github.com/kermitt2/pdfalto). + +Note that by default the Deep Learning models are not used, only CRF are selected in the configuration to accomodate "out of the box" hardware. You need to select the Deep Learning models to be used in the GROBID configuration file, according to your need and hardware capacities (in particular GPU availability and runtime requirements). ## Demo @@ -50,7 +52,7 @@ For testing purposes, a public GROBID demo server is available at the following The Web services are documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). -_Warning_: Some quota and query limitation apply to the demo server! Please be courteous and do not overload the demo server. +_Warning_: This demo runs only CRF models. Some quota and query limitation apply to the demo server! Please be courteous and do not overload the demo server. ## Clients diff --git a/doc/Configuration.md b/doc/Configuration.md index 9a33f21b6d..17f65bf8ef 100644 --- a/doc/Configuration.md +++ b/doc/Configuration.md @@ -149,7 +149,7 @@ Under `wapiti`, we find the generic parameters of the Wapiti engine, currently o ### DeLFT global parameters -Under `delft`, we find the generic parameters of the DeLFT engine. For using Deep Learning models, you will need an installation of the python library [DeLFT](https://github.com/kermitt2/delft). Use the following parameters to indicate the location of this installation, and optionally the path to the virtual environment folder of this installation: +Under `delft`, we find the generic parameters of the DeLFT engine. For using Deep Learning models, you will need an installation of the python library [DeLFT](https://github.com/kermitt2/delft) or to use the Docker image. For a local build, use the following parameters to indicate the location of this installation, and optionally the path to the virtual environment folder of this installation: ```yml delft: @@ -163,7 +163,7 @@ Under `delft`, we find the generic parameters of the DeLFT engine. For using Dee Each model has its own configuration indicating: -- which "engine" to be used, with values `wapiti` for featured-based CRF or `delft` for Deep Learning models. +- which "engine" to be used, with values `wapiti` for feature-based CRF or `delft` for Deep Learning models. - for Deep Learning models, which neural architecture to be used, with choices normally among `BidLSTM_CRF`, `BidLSTM_CRF_FEATURES`, `BERT`, `BERT-CRF`, `BERT_CRF_FEATURES`. The corresponding model/architecture combination need to be available under `grobid-home/models/`. If it is not the case, you will need to train the model with this particular architecture. diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md index ea7a4d0c15..e452817622 100644 --- a/doc/Deep-Learning-models.md +++ b/doc/Deep-Learning-models.md @@ -2,11 +2,11 @@ ## Integration with DeLFT -Since version `0.5.4` (2018), it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models include in particular BidLSTM-CRF with Glove embeddings, with additional feature channel (for layout features), with ELMo, and transformer-based fine-tuned architectures with or without CRF activation layer (e.g. SciBERT-CRF), which can be used as alternative to the default Wapiti CRF. +Since GROBID version `0.5.4` (2018), it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models include in particular BidLSTM-CRF with Glove embeddings, with additional feature channel (for layout features), with ELMo, and transformer-based fine-tuned architectures with or without CRF activation layer (e.g. SciBERT-CRF), which can be used as alternative to the default Wapiti CRF. -These architectures have been tested on Linux 64bit and macOS. +These architectures have been tested on Linux 64bit and macOS. -Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see [here](https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based). +Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see [here](https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink). Additionally, it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based). There are currently no neural model for the segmentation and the fulltext models, because the input sequences for these models are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for these tasks or to use alternative DL architectures (with sliding window, etc.). @@ -14,15 +14,31 @@ Low-level models not using layout features (author name, dates, affiliations...) See some evaluations under `grobid-trainer/docs`. -Current neural models can be up to 50 time slower than CRF, depending on the architecture and available CPU/GPU. However when sequences can be processed in batch (e.g. for the citation model), overall runtime remains good with clear accuracy gain. This is where the possibility to mix CRF and Deep Learning models for different structuring tasks is very useful, as it permits to adjust the balance between possible accuracy and scalability in a fine-grained manner, using a reasonable amount of memory. +Current neural models can be up to 50 times slower than CRF, depending on the architecture and available CPU/GPU. However when sequences can be processed in batch (e.g. for the citation model), overall runtime remains good with some clear accuracy gain for some models. This is where the possibility to mix CRF and Deep Learning models for different structuring tasks is very useful, as it permits to adjust the balance between possible accuracy and scalability in a fine-grained manner, using a reasonable amount of memory. + +## Recommended Deep Learning models + +By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. + +For current GROBID version 0.7.2, we recommend considering the usage of the following Deep Learning models: + +- `citation` model: for bibliographical parsing, the `BidLSTM_CRF_FEATURES` architecture provides currently the best accuracy, significantly better than CRF. With a GPU, there is normally no runtime impact by selecting this model. + +- `affiliation-address` model: for parsing affiliation and address blocks, `BidLSTM_CRF_FEATURES` architecture provides better accuracy than CRF at the cost of a minor runtime impact. + +- `reference-segmenter` model: this model segments a bibliographical reference section into individual references, `BidLSTM_CRF_FEATURES` architecture provides better accuracy than CRF (even on very very very long reference sections), but at the cost of a global runtime 2 to 3 times slower. + +Other Deep Learning models do not show better accuracy than old-school CRF according to our benchmarkings, so we do not recommend using them in general at this stage. However, some of them tend to be more portable and can be more reliable than CRF for document layouts and scientific domains far from what is available in the training data. + +Finally, the models `segmentation` (overall first-pass segmentation of a document in general zones) and `fulltext` (structuring the content body of a document) are currently only based on CRF, due to the long input sequences to be processed. ### Getting started with Deep Learning -Using Deep Learning model in GROBID with a normal installation/build is not straightforward at the present time, due to the required availability of various native libraries and to the Python dynamic linking and packaging mess, which leads to force some strict version and system dependencies. Interfacing natively to a particular Python virtual environment (which is "sesssion-based") is challenging. We are exploring different approach to facilitate this and get a "out-of-the-out" working system. +Using Deep Learning model in GROBID with a normal installation/build is not straightforward at the present time, due to the required availability of various native libraries and to the Python dynamic linking and packaging mess, which leads to force some strict version and system dependencies. Interfacing natively to a particular Python virtual environment (which is "session-based") is challenging. We are exploring different approach to facilitate this and get a "out-of-the-out" working system. The most simple solution is to use the ["full" GROBID docker image](Grobid-docker.md), which allows to use Deep Learning models without further installation and which provides automatic GPU support. -However if you need a "local" library installation and build, here are the step-by-step instructions to get a working Deep Learning GROBID. +However if you need a "local" library installation and build, prepare a lot of coffee, here are the step-by-step instructions to get a working local Deep Learning GROBID. #### Classic python and Virtualenv @@ -31,7 +47,7 @@ However if you need a "local" library installation and build, here are the step- You __must__ use a Java version under or equals to Java 11. At the present time, JVM 1.12 to 1.17 will fail to load the native JEP library (due to additional security constraints). 1. install [DeLFT](https://github.com/kermitt2/delft), see instructions [here](https://github.com/kermitt2/delft#install). -DeLFT version `0.3.1` has been tested successfully with Python 3.7 and 3.8. For GPU support, CUDA >=11.2 must be installed. +DeLFT version `0.3.2` has been tested successfully with Python 3.7 and 3.8. For GPU support, CUDA >=11.2 must be installed. 2. Test your DeLFT installation for GROBID models: @@ -110,8 +126,7 @@ INFO [2020-10-30 23:04:07,756] org.grobid.core.jni.DeLFTModel: Loading DeLFT mo INFO [2020-10-30 23:04:07,758] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 44 ``` -It is then possible to [benchmark end-to-end](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for two models (`citation` and `affiliation-address`), so there is likely no practical interest to use Deep Learning for the other models. This will of course certainly change in the future! - +It is then possible to [benchmark end-to-end](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future! #### Anaconda diff --git a/doc/Frequently-asked-questions.md b/doc/Frequently-asked-questions.md index 58fc2a515f..2fe2898dc9 100644 --- a/doc/Frequently-asked-questions.md +++ b/doc/Frequently-asked-questions.md @@ -30,7 +30,7 @@ The exact server configuration will depend on the service you want to call. We p You will get the embedded images converted into `.png` by using the normal batch command. For instance: ```console -java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.1-SNAPSHOT-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText +java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText ``` There is a web service doing the same, returning everything in a big zip file, `processFulltextAssetDocument`, still usable but deprecated. diff --git a/doc/Grobid-batch.md b/doc/Grobid-batch.md index f67e528fbe..e9c49149f5 100644 --- a/doc/Grobid-batch.md +++ b/doc/Grobid-batch.md @@ -18,7 +18,7 @@ The following command display some help for the batch commands: Be sure to replace `` with the current version of GROBID that you have installed and built. For example: ```bash -> java -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -h +> java -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -h ``` The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine). @@ -40,7 +40,7 @@ The needed parameters for that command are: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -64,7 +64,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -78,7 +78,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" ``` ### processAuthorsHeader @@ -90,7 +90,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" ``` ### processAuthorsCitation @@ -102,7 +102,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" ``` ### processAffiliation @@ -114,7 +114,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" ``` ### processRawReference @@ -126,7 +126,7 @@ Example: Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" ``` ### processReferences @@ -142,7 +142,7 @@ Example: Example: ```bash -> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences +> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -158,7 +158,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 ``` WARNING: extension of the ST.36 files to be processed must be .xml @@ -174,7 +174,7 @@ WARNING: extension of the ST.36 files to be processed must be .xml Example: ``` -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT ``` WARNING: extension of the text files to be processed must be .txt, and expected encoding is UTF-8 @@ -190,7 +190,7 @@ WARNING: extension of the text files to be processed must be .txt, and expected Example: ``` -> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF +> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF ``` WARNING: extension of the text files to be processed must be .pdf @@ -206,7 +206,7 @@ WARNING: extension of the text files to be processed must be .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -222,7 +222,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank +> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -240,7 +240,7 @@ The needed parameters for that command are: Example: ``` -> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation +> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.7.2-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation ``` WARNING: extension of the text files to be processed must be .pdf diff --git a/doc/Grobid-docker.md b/doc/Grobid-docker.md index 4cc064d122..89526a48d0 100644 --- a/doc/Grobid-docker.md +++ b/doc/Grobid-docker.md @@ -26,7 +26,7 @@ The process for retrieving and running the image is as follow: Current latest version: ```bash -> docker pull grobid/grobid:0.7.1 +> docker pull grobid/grobid:0.7.2 ``` - Run the container: @@ -113,7 +113,7 @@ Grobid web services are then available as described in the [service documentatio The simplest way to pass a modified configuration to the docker image is to mount the yaml GROBID config file `grobid.yaml` when running the image. Modify the config file `grobid/grobid-home/config/grobid.yaml` according to your requirements on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.2-SNAPSHOT +docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.3-SNAPSHOT ``` You need to use an absolute path to specify your modified `grobid.yaml` file. @@ -200,25 +200,25 @@ Without this requirement, the image might default to CPU, even if GPU are availa For being able to use both CRF and Deep Learningmodels, use the dockerfile `./Dockerfile.delft`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.7.1 --build-arg GROBID_VERSION=0.7.1 --file Dockerfile.delft . +> docker build -t grobid/grobid:0.7.2 --build-arg GROBID_VERSION=0.7.2 --file Dockerfile.delft . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -docker build -t grobid/grobid:0.7.2-SNAPSHOT --build-arg GROBID_VERSION=0.7.2-SNAPSHOT --file Dockerfile.delft . +docker build -t grobid/grobid:0.7.3-SNAPSHOT --build-arg GROBID_VERSION=0.7.3-SNAPSHOT --file Dockerfile.delft . ``` -In order to run the container of the newly created image, for example for the development version `0.7.2-SNAPSHOT`, using all GPU available: +In order to run the container of the newly created image, for example for the development version `0.7.3-SNAPSHOT`, using all GPU available: ```bash -> docker run --rm --gpus all -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.2-SNAPSHOT +> docker run --rm --gpus all -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.3-SNAPSHOT ``` In practice, you need to indicate which models should use a Deep Learning model implementation and which ones can remain with a faster CRF model implementation, which is done currently in the `grobid.yaml` file. Modify the config file `grobid/grobid-home/config/grobid.yaml` accordingly on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.2-SNAPSHOT +docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.3-SNAPSHOT ``` You need to use an absolute path to specify your modified `grobid.yaml` file. @@ -240,19 +240,19 @@ The container name is given by the command: For building a CRF-only image, the dockerfile to be used is `./Dockerfile.crf`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.7.1 --build-arg GROBID_VERSION=0.7.1 --file Dockerfile.crf . +> docker build -t grobid/grobid:0.7.2 --build-arg GROBID_VERSION=0.7.2 --file Dockerfile.crf . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -> docker build -t grobid/grobid:0.7.2-SNAPSHOT --build-arg GROBID_VERSION=0.7.2-SNAPSHOT --file Dockerfile.crf . +> docker build -t grobid/grobid:0.7.3-SNAPSHOT --build-arg GROBID_VERSION=0.7.3-SNAPSHOT --file Dockerfile.crf . ``` -In order to run the container of the newly created image, for example for version `0.7.1`: +In order to run the container of the newly created image, for example for version `0.7.2`: ```bash -> docker run -t --rm -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.1 +> docker run -t --rm -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.2 ``` For testing or debugging purposes, you can connect to the container with a bash shell (logs are under `/opt/grobid/logs/`): diff --git a/doc/Grobid-java-library.md b/doc/Grobid-java-library.md index a831ac34de..9b110d7c3e 100644 --- a/doc/Grobid-java-library.md +++ b/doc/Grobid-java-library.md @@ -9,7 +9,7 @@ The second option is of course to build yourself Grobid and to use the generated ## Using maven -The Java artefacts of the latest GROBID release (0.7.1) are uploaded on a DIY repository. +The Java artefacts of the latest GROBID release (0.7.2) are uploaded on a DIY repository. You need to add the following snippet in your `pom.xml` in order to configure it: @@ -29,19 +29,19 @@ Here an example of `grobid-core` dependency: org.grobid grobid-core - 0.7.1 + 0.7.2 ``` -If you want to work on a SNAPSHOT development version, you need to download and build the current master yourself, and include in your pom file the path to the local snapshot Grobid jar file, for instance as follow (if necessary replace `0.7.2-SNAPSHOT` by the valid ``): +If you want to work on a SNAPSHOT development version, you need to download and build the current master yourself, and include in your pom file the path to the local snapshot Grobid jar file, for instance as follow (if necessary replace `0.7.3-SNAPSHOT` by the valid ``): ```xml org.grobid grobid-core - 0.7.2-SNAPSHOT + 0.7.3-SNAPSHOT system - ${project.basedir}/lib/grobid-core-0.7.2-SNAPSHOT.jar + ${project.basedir}/lib/grobid-core-0.7.3-SNAPSHOT.jar ``` @@ -59,8 +59,8 @@ Add the following snippet in your gradle.build file: and add the Grobid dependency as well: ``` - compile 'org.grobid:grobid-core:0.7.1' - compile 'org.grobid:grobid-trainer:0.7.1' + compile 'org.grobid:grobid-core:0.7.2' + compile 'org.grobid:grobid-trainer:0.7.2' ``` ## API call diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md index 61bb8a1e41..e407294176 100644 --- a/doc/Grobid-service.md +++ b/doc/Grobid-service.md @@ -23,9 +23,9 @@ You could also build and install the service as a standalone service (let's supp cd .. mkdir grobid-installation cd grobid-installation -unzip ../grobid/grobid-service/build/distributions/grobid-service-0.7.1.zip -mv grobid-service-0.7.1 grobid-service -unzip ../grobid/grobid-home/build/distributions/grobid-home-0.7.1.zip +unzip ../grobid/grobid-service/build/distributions/grobid-service-0.7.2.zip +mv grobid-service-0.7.2 grobid-service +unzip ../grobid/grobid-home/build/distributions/grobid-home-0.7.2.zip ./grobid-service/bin/grobid-service ``` diff --git a/doc/Install-Grobid.md b/doc/Install-Grobid.md index de1a479f29..0fa582005e 100644 --- a/doc/Install-Grobid.md +++ b/doc/Install-Grobid.md @@ -6,17 +6,17 @@ GROBID requires a JVM installed on your machine, supported version is **JVM 8**. ### Latest stable release -The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.7.1``` which can be downloaded as follow: +The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.7.2``` which can be downloaded as follow: ```bash -> wget https://github.com/kermitt2/grobid/archive/0.7.1.zip -> unzip 0.7.1.zip +> wget https://github.com/kermitt2/grobid/archive/0.7.2.zip +> unzip 0.7.2.zip ``` or using the [docker](Grobid-docker.md) container. ### Current development version -The current development version is ```0.7.2-SNAPSHOT```, which can be downloaded from GitHub and built as follow: +The current development version is ```0.7.3-SNAPSHOT```, which can be downloaded from GitHub and built as follow: Clone source code from github: ```bash diff --git a/doc/Notes-grobid-developers.md b/doc/Notes-grobid-developers.md index 1abcd98f98..35ee6f3c0c 100644 --- a/doc/Notes-grobid-developers.md +++ b/doc/Notes-grobid-developers.md @@ -9,11 +9,11 @@ The idea anyway is that people will use Grobid with the Docker image, the servic In order to make a new release: -+ tag the project branch to be releases, for instance a version `0.7.1`: ++ tag the project branch to be releases, for instance a version `0.7.2`: ``` -> git tag 0.7.1 -> git push origin 0.7.1 +> git tag 0.7.2 +> git push origin 0.7.2 ``` + create a github release: the easiest is to use the GitHub web interface @@ -35,7 +35,7 @@ In order to make a new release: ``` dependencies { - implementation 'org.grobid:grobid-core:0.7.1' + implementation 'org.grobid:grobid-core:0.7.2' } ``` @@ -55,7 +55,7 @@ for maven projects: org.grobid grobid-core - 0.7.1 + 0.7.2 ``` diff --git a/gradle.properties b/gradle.properties index f817e1f446..c14edd65ec 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,4 +1,4 @@ -version=0.7.2-SNAPSHOT +version=0.7.2 # Set workers to 1 that even for parallel builds it works. (I guess the shadow plugin makes some trouble) org.gradle.workers.max=1 org.gradle.caching = true