This folder contains scripts for preprocessing three datasets: HotpotQA, MuSiQue, and TriviaQA. Each script is used to extract knowledge graphs (KGs) from the datasets. Below are the details of each script and how they are used:
> python hotpot_extraction.py
This script extracts triplets from the HotpotQA dataset. The main steps are as follows:
- Load the dataset file
hotpot_dev_distractor_v1.json
. - Use the
llama3:8b
model to extract triplets from each context paragraph. - Save the extracted triplets to the specified output directory.
> python musique_extraction.py
This script extracts triplets from the MuSiQue dataset. The main steps are as follows:
- Load the dataset file
musique_ans_v1.0_dev.jsonl
. - Use the
llama3:8b
model to extract triplets from each paragraph. - Save the extracted triplets to the specified output directory.
> python trivia_extraction.py
This script extracts triplets from the TriviaQA dataset. The main steps are as follows:
- Load the dataset file
trivia.json
. - Use the
llama3:8b
model to extract triplets from each context paragraph. - Save the extracted triplets to the specified output directory.