Skip to content

Latest commit

 

History

History
74 lines (45 loc) · 1.89 KB

README.md

File metadata and controls

74 lines (45 loc) · 1.89 KB

Extracting PDFs of Authors

📝 Table of Contents

🧐 About

Tool to obtain list of papers of interested profs from a CSV and parse PDFs into text for creating embeddings and query with GPT

Built using

  • SerpAPI
  • PyPDF2
  • OpenAI GPT 3
  • Tiktoken

🏁 Getting Started

Prerequisites

  • One needs to have an account with SerpAPI. SerpAPI is used to query Google Scholar, and it allows upto 100 free queries per month.

  • Additionally, one needs access to OpenAI GPT APIs.

Create a config.yaml file with the following keys:

csv: <CSV FILE NAME>
serpapi_key: <SerpAI API_KEY>
openai:
  api_key: <OpenAI API_KEY>
  organization: <Org name registered with OpenAI>

Installing

Create the environment

conda env create -f environment.yml

🎈 Usage

To fetch all papers from 2022 onwards of profs of interest: python fetch.py

This should create a folder papers which contain the PDFs

Then to extract data from PDFs run

python extract.py

This should create a folder papers_parse which contain the parsed data from each PDF

Finally, to ask a question from GPT run

python gpt.py -question <QUESTION> -new <True/False>

Set the -new flag to True if one wants to create new embeddings. Else set to False.

TODO

  • Instead of using PyPDF, use Grobid for better PDF parsing
  • Finetune GPT model

✍️ Authors