Tool to obtain list of papers of interested profs from a CSV and parse PDFs into text for creating embeddings and query with GPT
- SerpAPI
- PyPDF2
- OpenAI GPT 3
- Tiktoken
-
One needs to have an account with SerpAPI. SerpAPI is used to query Google Scholar, and it allows upto 100 free queries per month.
-
Additionally, one needs access to OpenAI GPT APIs.
Create a config.yaml
file with the following keys:
csv: <CSV FILE NAME>
serpapi_key: <SerpAI API_KEY>
openai:
api_key: <OpenAI API_KEY>
organization: <Org name registered with OpenAI>
Create the environment
conda env create -f environment.yml
To fetch all papers from 2022 onwards of profs of interest:
python fetch.py
This should create a folder papers
which contain the PDFs
Then to extract data from PDFs run
python extract.py
This should create a folder papers_parse
which contain the parsed data from each PDF
Finally, to ask a question from GPT run
python gpt.py -question <QUESTION> -new <True/False>
Set the -new
flag to True
if one wants to create new embeddings. Else set to False
.
- Instead of using PyPDF, use Grobid for better PDF parsing
- Finetune GPT model