Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Elsevier onion's ring API wrapping #18

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e8403a0
Avoid adding spaces introduced before a punctuation, adding tests
lfoppiano May 8, 2024
88bf63b
add build
lfoppiano May 8, 2024
26b7671
add build
lfoppiano May 8, 2024
911669f
more verbosity in case of failures
lfoppiano May 8, 2024
7d29e82
add setting file
lfoppiano May 8, 2024
a9a2061
Fix java version
lfoppiano May 8, 2024
42c470e
Fix java version
lfoppiano May 8, 2024
5a88123
Fix java version
lfoppiano May 8, 2024
9ba34c4
Update to dropwizard 4
lfoppiano May 8, 2024
e3772fd
Fix build
lfoppiano May 8, 2024
5147c20
Merge branch 'master' into update-dropwizard
lfoppiano May 8, 2024
7d1e62f
typo
lfoppiano May 9, 2024
be08be7
Merge branch 'refs/heads/bugfix/sentence-segmentation'
lfoppiano May 9, 2024
b76e607
fix imports
lfoppiano May 9, 2024
a378553
add missing dependencies
lfoppiano May 9, 2024
5f400b6
update config for dropwizard 4
lfoppiano May 9, 2024
4219459
Fix repository name
lfoppiano May 9, 2024
21b5901
remove afterburner
lfoppiano May 9, 2024
c66979c
update dependencies
lfoppiano May 10, 2024
93136b6
remove afterburner
lfoppiano May 9, 2024
a1fea77
Merge branch 'update-dropwizard'
lfoppiano May 10, 2024
2729a7c
Update ci-build-manual.yml
lfoppiano May 10, 2024
2a7bee1
update tests
lfoppiano May 13, 2024
a393d17
add root path
lfoppiano May 13, 2024
2a642f5
Revert "add root path"
lfoppiano May 13, 2024
a480325
add a generateIDs parameter and process
lfoppiano May 28, 2024
8ed9844
Merge pull request #1 from lfoppiano/add-generated-id
lfoppiano May 28, 2024
0ec5db5
add documentation
lfoppiano Jun 4, 2024
b7a89b4
update client
lfoppiano Jun 4, 2024
7ef6cda
Merge branch 'refs/heads/add-generated-id'
lfoppiano Jun 4, 2024
7968d80
Added a mechanism for Elsevier API content
laurentromary Sep 6, 2024
97d5d8b
allow suffix to the manual docker build
lfoppiano Sep 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/ci-build-manual.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Build and push a development version on docker

on:
workflow_dispatch:
inputs:
suffix:
type: string
description: Docker image suffix (e.g. dev, prod, feature1)
required: false


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 11
uses: actions/setup-java@v4
with:
java-version: '11'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /opt/hostedtoolcache
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/pub2tei
registry: docker.io
pushImage: true
tags: |
latest-develop${{ github.event.inputs.suffix != '' && '-' || '' }}${{ github.event.inputs.suffix }}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
27 changes: 27 additions & 0 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Build unstable

on: [push]

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout grobid home
uses: actions/checkout@v4
with:
repository: kermitt2/grobid
path: ./grobid
- name: Checkout Pub2TEI
uses: actions/checkout@v4
with:
path: ./grobid/Pub2TEI
- name: Set up JDK 11
uses: actions/setup-java@v4
with:
java-version: '11'
distribution: 'temurin'
cache: 'gradle'
- name: Build and run integration tests
working-directory: ./grobid/Pub2TEI
run: ./gradlew test --stacktrace --info
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## Docker Pub2TEI image using Grobid deep learning models and/or CRF models for transformation enhancements

# this is the full GROBID image using NVIDIA Container Toolkit to automatically recognize possible GPU drivers on the host machine
FROM grobid/grobid:0.8.0
FROM lfoppiano/grobid:0.8.0-full-slim

# Add Tini
ENV TINI_VERSION v0.19.0
Expand Down
23 changes: 11 additions & 12 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,8 @@ git clone https://github.com/kermitt2/Pub2TEI
cd client
python3 pub2tei_client.py --help

usage: pub2tei_client.py [-h] [--input INPUT] [--output OUTPUT] [--config CONFIG] [--n N]
[--consolidate_references] [--segment_sentences] [--grobid_refine] [--force]
[--verbose]
usage: pub2tei_client.py [-h] --input INPUT [--output OUTPUT] [--config CONFIG] [--n N] [--consolidate_references] [--segment_sentences]
[--generate_ids] [--grobid_refine] [--force] [--verbose]

Client for Pub2TEI services

Expand All @@ -86,10 +85,9 @@ optional arguments:
--n N concurrency for service usage
--consolidate_references
use GROBID for consolidation of the bibliographical references
--segment_sentences segment sentences in the text content of the document with additional <s>
elements
--grobid_refine use Grobid to structure/enhance raw fields: affiliations, references, person,
dates
--segment_sentences segment sentences in the text content of the document with additional <s> elements
--generate_ids Generate idenfifier for each text item
--grobid_refine use Grobid to structure/enhance raw fields: affiliations, references, person, dates
--force force re-processing pdf input files when tei output files already exist
--verbose print information about processed files in the console
```
Expand All @@ -112,12 +110,13 @@ Note that the consolidation is realized with the consolidation service indicated

Tranform a publisher XML into TEI XML format, with optional enhancements.

| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--- |--- |--- |
| POST | `multipart/form-data` | `application/xml` | `input` | required | publisher XML file to be processed |
| | | | `segmentSentences` | optional | Boolean, if true the paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `grobidRefine` | optional | Boolean, if true the raw affiliations and raw biblographical reference strings will be parsed with Grobid and the resulting structured information added in the transformed TEI XML |
| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |-------------------------|--- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST | `multipart/form-data` | `application/xml` | `input` | required | publisher XML file to be processed |
| | | | `segmentSentences` | optional | Boolean, if true the paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `grobidRefine` | optional | Boolean, if true the raw affiliations and raw biblographical reference strings will be parsed with Grobid and the resulting structured information added in the transformed TEI XML |
| | | | `consolidateReferences` | optional | Consolidate all the biblographical references, `consolidateReferences` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `generateIDs` | optional | Inject the attribute `xml:id` in the textual elements (`title`, `note`, `term`, `keywords`, `p`, `s`) |

Response status codes:

Expand Down
7 changes: 6 additions & 1 deletion Stylesheets/Elsevier.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:svapi="http://www.elsevier.com/xml/svapi/article/dtd"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:els1="http://www.elsevier.com/xml/ja/dtd"
xmlns:els2="http://www.elsevier.com/xml/cja/dtd"
Expand Down Expand Up @@ -2428,6 +2429,10 @@
</xsl:otherwise>
</xsl:choose>
</xsl:variable>

<xsl:template match="svapi:full-text-retrieval-response">
<xsl:apply-templates select="descendant::els1:article"/>
</xsl:template>

<xsl:template match="els1:article[els1:item-info] |els2:article[els2:item-info] | els1:converted-article[els1:item-info] | els2:converted-article[els2:item-info] | converted-article[item-info] | article[item-info]">
<!--xsl:comment>
Expand Down
14 changes: 14 additions & 0 deletions Stylesheets/FullTextTags.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -850,6 +850,20 @@
</quote>
</xsl:template>

<!-- Elsevier displayed-quote/ce:simple-para -->

<xsl:template match="ce:displayed-quote">
<cit>
<xsl:apply-templates/>
</cit>
</xsl:template>

<xsl:template match="ce:displayed-quote/ce:simple-para">
<quote>
<xsl:apply-templates/>
</quote>
</xsl:template>

<!-- Formarting elements that we discard -->

<xsl:template match="ce:vsp"/>
Expand Down
4 changes: 4 additions & 0 deletions Stylesheets/Publishers.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:svapi="http://www.elsevier.com/xml/svapi/article/dtd"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:els1="http://www.elsevier.com/xml/ja/dtd"
xmlns:els2="http://www.elsevier.com/xml/cja/dtd"
Expand Down Expand Up @@ -55,6 +56,9 @@
<xsl:when test="els1:article[els1:item-info] | els2:article[els2:item-info] | els1:converted-article[els1:item-info] | els2:converted-article[els2:item-info] | converted-article[item-info]">
<xsl:message>Converting an Elsevier article</xsl:message>
</xsl:when>
<xsl:when test="svapi:full-text-retrieval-response">
<xsl:message>Converting an Elsevier article obtained from the Elsevier API</xsl:message>
</xsl:when>
<xsl:when test="nihms-submit">
<xsl:message>Converting a Nature article</xsl:message>
</xsl:when>
Expand Down
65 changes: 34 additions & 31 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@ repositories {
maven {
url new File(rootProject.rootDir, "localLibs")
}
maven { url "https://grobid.s3.eu-west-1.amazonaws.com/repo/" }
flatDir {
dirs 'localLibs'
}
// maven { url "https://grobid.s3.eu-west-1.amazonaws.com/repo/" }
}

apply plugin: 'application'
Expand All @@ -39,8 +42,8 @@ version '0.2'

description = """transform the myriad of scientific publisher XML into the same TEI XML format, common to GROBID"""

sourceCompatibility = 1.8
targetCompatibility = 1.8
sourceCompatibility = 1.11
targetCompatibility = 1.11

tasks.withType(JavaCompile) {
options.encoding = 'UTF-8'
Expand Down Expand Up @@ -77,8 +80,10 @@ dependencies {
testImplementation group: 'junit', name: 'junit', version: '4.12'
testImplementation "org.hamcrest:hamcrest-all:1.3"
testImplementation "org.easymock:easymock:3.5"
testImplementation "org.xmlunit:xmlunit-matchers:2.10.0"
testImplementation "org.xmlunit:xmlunit-legacy:2.10.0"

// packaging local libs
// packaging local libs
implementation fileTree(dir: new File(rootProject.rootDir, 'localLibs'), include: localLibs)

implementation(group: 'xml-apis', name: 'xml-apis') {
Expand All @@ -96,28 +101,39 @@ dependencies {
implementation group: 'org.grobid', name: 'grobid-core', version: '0.8.0'
implementation "black.ninia:jep:4.0.2"

implementation "io.dropwizard:dropwizard-core:1.3.29"
implementation "io.dropwizard:dropwizard-assets:1.3.29"
implementation "com.hubspot.dropwizard:dropwizard-guicier:1.3.5.2"
implementation "io.dropwizard:dropwizard-testing:1.3.29"
implementation "io.dropwizard:dropwizard-forms:1.3.29"
implementation "io.dropwizard:dropwizard-client:1.3.29"
implementation "io.dropwizard:dropwizard-auth:1.3.29"
implementation "io.dropwizard.metrics:metrics-core:4.0.5"
implementation "io.dropwizard.metrics:metrics-servlets:4.0.5"
implementation 'ru.vyarus:dropwizard-guicey:7.0.0'
implementation 'io.dropwizard:dropwizard-bom:4.0.0'
implementation 'io.dropwizard:dropwizard-core:4.0.0'
implementation 'io.dropwizard:dropwizard-assets:4.0.0'
implementation 'io.dropwizard:dropwizard-testing:4.0.0'
implementation 'io.dropwizard:dropwizard-forms:4.0.0'
implementation 'io.dropwizard:dropwizard-client:4.0.0'
implementation 'io.dropwizard:dropwizard-auth:4.0.0'
implementation 'io.dropwizard.metrics:metrics-core:4.2.22'
implementation 'io.dropwizard.metrics:metrics-servlets:4.2.22'

implementation "xerces:xercesImpl:2.12.0"
implementation "net.arnx:jsonic:1.3.10"
implementation "net.sf.saxon:Saxon-HE:9.6.0-9"
implementation "xom:xom:1.3.2"
implementation 'javax.xml.bind:jaxb-api:2.3.0'
implementation 'org.apache.opennlp:opennlp-tools:1.9.1'
implementation 'black.ninia:jep:4.0.2'
implementation "org.apache.httpcomponents:httpclient:4.5.3"
implementation "org.apache.lucene:lucene-analyzers-common:4.5.1"
implementation group: 'org.jruby', name: 'jruby-complete', version: '9.2.13.0'

implementation 'org.slf4j:slf4j-api:1.7.30'
implementation 'ch.qos.logback:logback-classic:1.2.3'

implementation "com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.3"

//Parsing XML/JSON
//implementation group: 'org.codehaus.woodstox', name: 'stax2-api', version: '4.0.0'
//implementation group: 'org.codehaus.woodstox', name: 'woodstox-core-asl', version: '4.4.1'
implementation "com.fasterxml.jackson.core:jackson-core:2.10.1"
implementation "com.fasterxml.jackson.core:jackson-databind:2.10.1"
implementation "com.fasterxml.jackson.module:jackson-module-afterburner:2.10.1"
implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.1"
implementation "com.fasterxml.jackson.core:jackson-core:2.14.3"
implementation "com.fasterxml.jackson.core:jackson-databind:2.14.3"
// implementation "com.fasterxml.jackson.module:jackson-module-afterburner:2.14.3"
implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.14.3"

// compile group: 'org.apache.httpcomponents', name: 'httpmime', version: '4.5.3'
implementation 'org.apache.commons:commons-collections4:4.1'
Expand All @@ -127,19 +143,6 @@ dependencies {
}


/*def libraries = ""
if (Os.isFamily(Os.FAMILY_MAC)) {
if (Os.OS_ARCH.equals("aarch64")) {
libraries = "${file("../grobid-home/lib/mac_arm-64").absolutePath}"
} else {
libraries = "${file("../grobid-home/lib/mac-64").absolutePath}"
}
} else if (Os.isFamily(Os.FAMILY_UNIX)) {
libraries = "${file("../grobid-home/lib/lin-64/jep").absolutePath}:" +
"${file("../grobid-home/lib/lin-64").absolutePath}:"
} else {
throw new RuntimeException("Unsupported platform!")
}*/

task mainJar(type: ShadowJar) {
zip64 true
Expand Down
Loading