Everyone is welcome to contribute, and we value contributions from the community. One of the best ways to contribute is by adding a data set to the evaluation benchmark!
- Find a task on the issues page. Self-assign or comment to indicate interest.
- Coordinate when more than 1 contributor have indicated interests.
- Open a new branch
- Open a pull request "Add <task_name> dataset" when you are ready. Make sure to include
- which model(s) the task was evaluated on
- computation time benchmark on GPU (preferred) and/or CPU
- New tasks will be placed under
evaluation/tasks
- Make a copy of the directory
evaluation/tasks/template
and rename the directory to match your task, i.e. in the root directory, run
cp -r evaluation/tasks/template evaluation/tasks/{{SOME_NEW_TASK}}
- Your new task directory will include 4 files:
__init__.py
english.json
: json file for task-specific configurations of english-only data (e.g. batch_size)- [For multilingual tasks only]
multilingual.json
: json file for task-specific configuration of multilingual data task_name.py
: the main module
- Wrap data as Pytorch Dataset/DataLoader
- Rename TemplateTask (which inherits
AutoTask
) to match your task - Implement all abstract your task
References:
- Template task
- Fully implemented example for TydiQA Secondary
- Feel free to use Hugging Face's GPT2LMHead as the base model
- Make modifications and commit any changes. It's best to make your commit messages informative to help your reviewer. Below is a few list of meta-labels to get you started.
# feat (new feature)
# fix (bug fix)
# refactor (refactoring production code)
# style (formatting, missing semi colons, etc; no code change)
# docs (changes to documentation)
# test (adding or refactoring tests; no production code change)
# chore (updating grunt tasks etc; no production code change)
# build (changes that affect the build system or external dependencies)
# ci (changes to our CI configuration files and scripts)
# version (version bump/new release; no production code change)
# debug (Changes in debugging code/frameworks; no production code change)
# license (Edits regarding licensing; no production code change)
# hack (Temporary fix to make things move forward; please avoid)
For example, one possible commit message would be feat: implement lambada evaluation
.
-
Write prompts to reformat the dataset to LM task if necessary (e.g. QA tasks)
- Submit prompts to the promptsource repo
- Prompts are in jinja2 format
- Try to have at least 3 prompts
-
Run
make quality
at the roof of the repo to check for linting and code styling issues -
Run
make style
at the root of the repo to auto-format the code
- Update the Overleaf Tech Report with information on the task you added
- Add a new Github issue requesting your task be made multilingual
- Label the issue with “multilingual”
- Specify in the text of the issue which languages the task already supports
- The multilinguality group is working on recruiting speakers of all the training languages to adapt English prompts to other languages