-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use promptsource templates #62
base: main
Are you sure you want to change the base?
feat: use promptsource templates #62
Conversation
4207819
to
5cd09aa
Compare
@@ -4,3 +4,4 @@ tensorflow==2.5.0 | |||
torch==1.9.0 | |||
tqdm==4.62.0 | |||
transformers==4.9.1 | |||
promptsource @ git+https://[email protected]/bigscience-workshop/promptsource.git@main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side note: ssh will fail.
|
||
def test_promptsource_template(): | ||
ds_key, sub_key = "tydiqa", "secondary_task" | ||
tydiqa_sec_vld_ds = load_dataset(ds_key, sub_key, split="validation", streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
promptsource
also has a helper of dataset loading but I really want to use streaming=True
if at all possible (depending on each dataset's compression format).
tydiqa_sec_vld_ds_en = filter(lambda x: x["id"].split("-")[0] == "english", tydiqa_sec_vld_ds) | ||
template_collection = TemplateCollection() | ||
tydiqa_sec_tmpls = template_collection.get_dataset(ds_key, sub_key) | ||
tmpl = tydiqa_sec_tmpls["simple_question_reading_comp_2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same prompt template of evaluation.tasks.tydiqa_secondary.TyDiQADataset
.
template_collection = TemplateCollection() | ||
tydiqa_sec_tmpls = template_collection.get_dataset(ds_key, sub_key) | ||
tmpl = tydiqa_sec_tmpls["simple_question_reading_comp_2"] | ||
prompt, _ = tmpl.apply(removeHyphen(next(tydiqa_sec_vld_ds_en))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value is actually a list, but if the template didn't apply, then there will be no second element (the expected answer/target).
Although only doing removeHyphen()
here, promptsource
has some more preprocessing for classification, see https://github.com/bigscience-workshop/promptsource/blob/main/promptsource/seqio_tasks/tasks.py
5cd09aa
to
b9ae559
Compare
@@ -19,6 +19,8 @@ tensorflow = "2.5.0" | |||
torch = "1.9.0" | |||
tqdm = "4.62.0" | |||
transformers = "4.9.1" | |||
promptsource = {git = "https://[email protected]/bigscience-workshop/promptsource.git", rev = "main"} | |||
aiohttp = "^3.7.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as dataset[streaming]
but we may want to control the version of aiohttp
separately just in case.
@@ -4,3 +4,5 @@ tensorflow==2.5.0 | |||
torch==1.9.0 | |||
tqdm==4.62.0 | |||
transformers==4.9.1 | |||
promptsource @ git+https://[email protected]/bigscience-workshop/promptsource.git@main | |||
aiohttp==3.7.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -19,6 +19,8 @@ tensorflow = "2.5.0" | |||
torch = "1.9.0" | |||
tqdm = "4.62.0" | |||
transformers = "4.9.1" | |||
promptsource = {git = "https://[email protected]/bigscience-workshop/promptsource.git", rev = "main"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple proposal of using promptsource directly such that we don't have to implement it from scratch.