Data Preprocessing

This folder contains scripts for preprocessing three datasets: HotpotQA, MuSiQue, and TriviaQA. Each script is used to extract knowledge graphs (KGs) from the datasets. Below are the details of each script and how they are used:

HotpotQA Dataset

> python hotpot_extraction.py

This script extracts triplets from the HotpotQA dataset. The main steps are as follows:

Load the dataset file hotpot_dev_distractor_v1.json.
Use the llama3:8b model to extract triplets from each context paragraph.
Save the extracted triplets to the specified output directory.

MuSiQue Dataset

> python musique_extraction.py

This script extracts triplets from the MuSiQue dataset. The main steps are as follows:

Load the dataset file musique_ans_v1.0_dev.jsonl.
Use the llama3:8b model to extract triplets from each paragraph.
Save the extracted triplets to the specified output directory.

TriviaQA Dataset

> python trivia_extraction.py

This script extracts triplets from the TriviaQA dataset. The main steps are as follows:

Load the dataset file trivia.json.
Use the llama3:8b model to extract triplets from each context paragraph.
Save the extracted triplets to the specified output directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Data Preprocessing

HotpotQA Dataset

MuSiQue Dataset

TriviaQA Dataset

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Data Preprocessing

HotpotQA Dataset

MuSiQue Dataset

TriviaQA Dataset