Skip to content

Latest commit

 

History

History
65 lines (48 loc) · 2.17 KB

README.md

File metadata and controls

65 lines (48 loc) · 2.17 KB
sciparser-logo

PDF parsing toolkit for preparing text corpus

Introduction

This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

Requirements

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt

Usage

  • python

First we should clone the hold repo.

git clone https://github.com/davendw49/sciparser.git

Then import the pipeline file to do the parsing.

from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')
  • gradio
python main.py

Citation

@misc{sciparser,
  author = {Cheng Deng},
  title = {Sciparser: PDF parsing toolkit for preparing text corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davendw49/sciparser}},
}

Reference