Releases: UCREL/science_parse_py_api
Releases · UCREL/science_parse_py_api
Works without specifying port
In the previous version you always had to specify a port e.g.:
from science_parse_api.api import parse_pdf
host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, _file, port=port)
If you ran it without the port it would raise an error e.g.:
from science_parse_api.api import parse_pdf
host = 'http://127.0.0.1:8080'
output_dict = parse_pdf(host, _file)
This has now been corrected in this version.
Version 1.0.0
Version 1.0.0
This is the first working release of the Science Parse python API.
Functions
It only has one function science_parse_api.api.parse_pdf
which takes the following as input:
- server_address -- Address to the science parse server e.g. "http://127.0.0.1"
- file_path_to_pdf -- The file path to the PDF you would like to parse.
- port -- Port of the science parse server e.g. "8080"
From this it will return the parsed PDF as a Python dictionary with the following keys:
['abstractText', 'authors', 'id', 'references', 'sections', 'title', 'year']
Example
The example below shows how to use the pdf_parse
function and the expected output format. In this example we ran the science parse server using docker e.g.:
docker run -p 127.0.0.1:8080:8080 --rm ucrel/ucrel-science-parse:3.0.1
from pathlib import Path
import tempfile
from IPython.display import Image
import requests
from science_parse_api.test_helper import test_data_dir
try:
# Tries to find the folder `test_data`
test_data_directory = test_data_dir()
test_pdf_paper = Path(test_data_directory,
'example_for_test.pdf').resolve()
image_file_name = str(Path(test_data_directory,
'example_test_pdf_as_png.png'))
except FileNotFoundError:
# If it cannot find that folder will get the pdf and
# image from Github. This will occur if you are using
# Google Colab
pdf_url = ('https://github.com/UCREL/science_parse_py_api/'
'raw/master/test_data/example_for_test.pdf')
temp_test_pdf_paper = tempfile.NamedTemporaryFile('rb+')
test_pdf_paper = Path(temp_test_pdf_paper.name)
with test_pdf_paper.open('rb+') as test_fp:
test_fp.write(requests.get(pdf_url).content)
image_url = ('https://github.com/UCREL/science_parse_py_api'
'/raw/master/test_data/example_test_pdf_as_png.png')
image_file = tempfile.NamedTemporaryFile('rb+', suffix='.png')
with Path(image_file.name).open('rb+') as image_fp:
image_fp.write(requests.get(image_url).content)
image_file_name = image_file.name
Image(filename=image_file_name)
import pprint
from science_parse_api.api import parse_pdf
host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, test_pdf_paper, port=port)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(output_dict)
{ 'abstractText': 'The abstract which is normally short.',
'authors': [{'affiliations': [], 'name': 'Andrew Moore'}],
'id': 'SP:045daa3afe8335ca973de6dbed366626376434da',
'references': [ { 'authors': [ 'Tomas Mikolov',
'Greg Corrado',
'Kai Chen',
'Jeffrey Dean.'],
'title': 'Efficient estimation of word '
'representations in vector space',
'venue': 'Proceedings of the International '
'Conference on Learning Representations, '
'pages 1–12.',
'year': 2013}],
'sections': [ { 'text': 'The abstract which is normally short.\n'
'1 Introduction\n'
'Some introduction text.\n'
'2 Section 1\n'
'Here is some example text.'},
{ 'heading': '2.1 Sub Section 1',
'text': 'Some more text but with a reference (Mikolov '
'et al., 2013).\n'
'3 Section 2\n'
'The last section\n'
'References\n'
'Tomas Mikolov, Greg Corrado, Kai Chen, and '
'Jeffrey Dean. 2013. Efficient estimation of '
'word representations in vector space. '
'Proceedings of the International Conference '
'on Learning Representations, pages 1–12.'}],
'title': 'Example paper for testing',
'year': 2021}