Processing scripts for the `jafarisbarov/azwiki` dataset

You can find the final dataset here

Working with the scripts

Set-up the Python environment:

python3 -m venv env

source env/bin/activate

pip install -r requirements.txt

Download the latest multistream version from wiki dumps:

# Download
wget https://dumps.wikimedia.org/azwiki/latest/azwiki-latest-pages-articles-multistream.xml.bz2

# Extract the xml file
bzip2 -dk azwiki-latest-pages-articles-multistream.xml.bz2

We used the latest dump as of 15.12.2023.

Run the dewiki script to extract articles from the xml file. They will be stored in the raw folder. After this, you can run the process.py scrip to clean contents and filter out some files. Results will be saved in the processed folder.

# Extract the articles from the xml file
python scripts/dewiki.py

# Process
python scripts/process.py

Working with Huggingface Datasets

Create a config.py file at the root of your project folder. Define the following variables:

hf_repo = "username/dataset_name"
hf_token = "your_hf_access_token"

After this, simply run the hf.py file to push the dataset to HuggingFace.

python scripts/hf.py

Acknowledgements and contribution

Original version of the process.py script has been developed during the AzCorpus project. dewiki.py script has been adapted from this repository.

If there is anything to update regarding the dataset, open a PR here, on GitHub. I will update the HuggingFace repo myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Processing scripts for the `jafarisbarov/azwiki` dataset

Working with the scripts

Working with Huggingface Datasets

Acknowledgements and contribution

Files

README.md

Latest commit

History

README.md

File metadata and controls

Processing scripts for the jafarisbarov/azwiki dataset

Working with the scripts

Working with Huggingface Datasets

Acknowledgements and contribution

Processing scripts for the `jafarisbarov/azwiki` dataset