This guide explains what steps a project administrator needs to perform when updating a dataset implementation.
First, checkout the pull-request to obtain a local copy using the GitHub CLI:
gh pr checkout PULL-REQUEST
To ensure the highest possible standard and uniformity of the data set implementations, please check the following dataset requirements.
- The dataset should be implemented in
bigbio/hub/hub_repos/<dataset>
and contain (at least) the three default files<dataset>.py
,bigbiohub.py
andREADME.md
. - Check whether all data set meta-data are given in
<dataset>.py
andREADME.md
. Refer to BC5CDR for an example of a complete set of information. - The dataset should not import
bigbio
but instead use.bigbiohub
.
Run the following command from the top level of the biomedical
repo (i.e. the same directory that contains the requirements.txt
file).
Check if the new or updated dataloader satisfies our unit tests as follows by using this command in the terminal:
python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
Note, you MUST include the --test_local
flag to specifically test the script for your PR, otherwise
the script will default to downloading a dataloader script from the Hub. Your particular dataset may
require use of some of the other command line args in the test script (ex: --data_dir
for dataloaders
that read local files).
To view full usage instructions you can use the --help
command:
python -m tests.test_bigbio --help
This will explain the types of arguments you may need to test for. A brief annotation is as such:
dataset_name
: Name of the dataset you want to testdata_dir
: The location of the data for datasets whereLOCAL_ = True
config_name
: Name of the configuration you want to test. By default, the script will test all configs, but if you can use this to debug a specific split, or if your data is prohibitively large.ishub
: Use this when unit testing scripts that are not yet uploaded to the hub (this is True for most cases)
If any (severe) errors occur, report these to the PR author.
Finally, the data set needs to be updated in the BigBio Huggingface datasets hub.
Note that this step should be performed just before merging the PR, as uploading to the hub might fail due to errors in the README / dataset card that the PR author can still fix.
For this, first create or retrieve an API access token of your huggingface account:
https://huggingface.co/settings/tokens
Note to create a token with write access:
Run the following command from the top-level of the repository to update the data set in the hub:
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset>
If the PR is concerned with a completely new data set add the option -c
for creating a new data
set repo in the hub first:
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -c
Moreover, you can test your upload command by first running a dry-run using option -d
:
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -d
After running the command, visit hub webpage of the data set and check if the model card and the data viewer are displayed and the files are updated correctly, e.g. https://huggingface.co/datasets/bigbio/bc5cdr
If all previous checks could be performed successfully merge the PR into main branch:
gh pr merge PULL-REQUEST