Skip to content

Releases: UCREL/science_parse_py_api

Works without specifying port

27 Mar 08:46
Compare
Choose a tag to compare

In the previous version you always had to specify a port e.g.:

from science_parse_api.api import parse_pdf

host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, _file, port=port)

If you ran it without the port it would raise an error e.g.:

from science_parse_api.api import parse_pdf

host = 'http://127.0.0.1:8080'
output_dict = parse_pdf(host, _file)

This has now been corrected in this version.

Version 1.0.0

10 Mar 09:38
Compare
Choose a tag to compare

Version 1.0.0

This is the first working release of the Science Parse python API.

Functions

It only has one function science_parse_api.api.parse_pdf which takes the following as input:

  1. server_address -- Address to the science parse server e.g. "http://127.0.0.1"
  2. file_path_to_pdf -- The file path to the PDF you would like to parse.
  3. port -- Port of the science parse server e.g. "8080"

From this it will return the parsed PDF as a Python dictionary with the following keys:

['abstractText', 'authors', 'id', 'references', 'sections', 'title', 'year']

Example

The example below shows how to use the pdf_parse function and the expected output format. In this example we ran the science parse server using docker e.g.:

docker run -p 127.0.0.1:8080:8080 --rm ucrel/ucrel-science-parse:3.0.1
from pathlib import Path
import tempfile

from IPython.display import Image
import requests

from science_parse_api.test_helper import test_data_dir

try:
    # Tries to find the folder `test_data`
    test_data_directory = test_data_dir()
    test_pdf_paper = Path(test_data_directory, 
                      'example_for_test.pdf').resolve()
    image_file_name = str(Path(test_data_directory, 
                               'example_test_pdf_as_png.png'))
except FileNotFoundError:
    # If it cannot find that folder will get the pdf and 
    # image from Github. This will occur if you are using 
    # Google Colab
    pdf_url = ('https://github.com/UCREL/science_parse_py_api/'
               'raw/master/test_data/example_for_test.pdf')
    temp_test_pdf_paper = tempfile.NamedTemporaryFile('rb+')
    test_pdf_paper = Path(temp_test_pdf_paper.name)
    with test_pdf_paper.open('rb+') as test_fp:
        test_fp.write(requests.get(pdf_url).content)
        
    image_url = ('https://github.com/UCREL/science_parse_py_api'
                 '/raw/master/test_data/example_test_pdf_as_png.png')
    image_file = tempfile.NamedTemporaryFile('rb+', suffix='.png')
    with Path(image_file.name).open('rb+') as image_fp:
        image_fp.write(requests.get(image_url).content)
    image_file_name = image_file.name
    

Image(filename=image_file_name)

png

import pprint
from science_parse_api.api import parse_pdf

host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, test_pdf_paper, port=port)

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(output_dict)
{   'abstractText': 'The abstract which is normally short.',
    'authors': [{'affiliations': [], 'name': 'Andrew Moore'}],
    'id': 'SP:045daa3afe8335ca973de6dbed366626376434da',
    'references': [   {   'authors': [   'Tomas Mikolov',
                                         'Greg Corrado',
                                         'Kai Chen',
                                         'Jeffrey Dean.'],
                          'title': 'Efficient estimation of word '
                                   'representations in vector space',
                          'venue': 'Proceedings of the International '
                                   'Conference on Learning Representations, '
                                   'pages 1–12.',
                          'year': 2013}],
    'sections': [   {   'text': 'The abstract which is normally short.\n'
                                '1 Introduction\n'
                                'Some introduction text.\n'
                                '2 Section 1\n'
                                'Here is some example text.'},
                    {   'heading': '2.1 Sub Section 1',
                        'text': 'Some more text but with a reference (Mikolov '
                                'et al., 2013).\n'
                                '3 Section 2\n'
                                'The last section\n'
                                'References\n'
                                'Tomas Mikolov, Greg Corrado, Kai Chen, and '
                                'Jeffrey Dean. 2013. Efficient estimation of '
                                'word representations in vector space. '
                                'Proceedings of the International Conference '
                                'on Learning Representations, pages 1–12.'}],
    'title': 'Example paper for testing',
    'year': 2021}