PDF parsing toolkit for preparing text corpus

Introduction

This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

Try DEMO

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

Requirements

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt

Usage

python

First we should clone the hold repo.

git clone https://github.com/davendw49/sciparser.git

Then import the pipeline file to do the parsing.

from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')

gradio

python main.py

Citation

@misc{sciparser,
  author = {Cheng Deng},
  title = {Sciparser: PDF parsing toolkit for preparing text corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davendw49/sciparser}},
}

Reference

PDF Parser ToolKits: https://github.com/Acemap/pdf_parser
TEI-XML Parser (grobid_tei_xml): https://gitlab.com/internetarchive/grobid_tei_xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PDF parsing toolkit for preparing text corpus

Introduction

Requirements

Usage

Citation

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

PDF parsing toolkit for preparing text corpus

Introduction

Requirements

Usage

Citation

Reference