-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for auxiliary dataset generation #204
Conversation
Checkpointing my work here to let CI chew on this and see if it works on more than just my machine. Any comments and suggestions are welcome, although keeping this marked as draft for now as I'd like to do another pass on this work myself to clean and document things a bit. |
81a864c
to
0edd68a
Compare
Added a commit to this PR that at least partially fixes #205 for now. It feels wrong adding that to this PR, and it should likely get pulled out into its own PR perhaps with accompanying test that hands a basic yaml block definition to the schema validation for each block type just to ensure baseline expected block configuration is covered by the schemas. I know we have |
done in #206
Filed #207 Thanks! |
This pull request has merge conflicts that must be resolved before it can be |
d9f607b
to
38c35da
Compare
src/instructlab/sdg/datamixing.py
Outdated
object if found or None if the instructions yaml file does not exist. | ||
""" | ||
auxilary_path = resources.files(__package__).joinpath( | ||
"configs/knowledge/auxiliary_instructions.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything else in this "configs" dir is a prompt template? Can we move this one into a new dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mirroring where these files are already placed by some of our downstream users, so it could be moved but we'll need to coordinate with them to ensure downstream auxiliary instructions gets moved as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard for me, but I took a first stab at this by moving this from configs/
to instructions/auxiliary_knowledge.yaml
.
234144f
to
24a43a5
Compare
This pull request has merge conflicts that must be resolved before it can be |
24a43a5
to
d4e79ee
Compare
4e47773
to
bccb44c
Compare
Comment from me on slack:
|
@bbrowning What's left on your list before this can be moved out of draft state? |
One remaining issue is this:
i.e. the only location the code can load instructions from is this path in the Python package If downstream custom pipeline configs need different instructions, then we need to be able to load it from the same place the custom pipeline configs are installed. This is how we find pipeline configs:
We also load
so the most obvious thing would be to load from e.g. I still feel this is suboptimal, and we'll probably want to change this in future so the instructions location is a relative path specified in the pipeline config |
I started implementing this, but in the case where downstream custom pipelines are installed, then we need to know to use the instructions in the Python package with the We definitely need a way to link the pipeline config with the instructions. |
824e163 adds the auxiliary instructions to the pipeline config |
This pull request has merge conflicts that must be resolved before it can be |
824e163
to
658d7f3
Compare
658d7f3
to
cf30b3e
Compare
Fixed this e2e failure:
|
config: | ||
columns_map: | ||
document: raw_document | ||
corrected_document: document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me nervous - columns_map
is a dict ... what's guaranteeing the ordering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the datasets library:
def rename(columns):
return [column_mapping[col] if col in column_mapping else col for col in columns]
i.e. it's applying these in the order the columns in the dataset
Ok, at least that's deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and good catch - that seems quite fragile, and this config should probably be refactored to be an array? Or this split out into two steps, to take out any doubt of the ordering the renaming is applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be done after merging this larger PR, since the order happens to be deterministic. Especially because it may warrant a second look at the API exposed by RenameColumnsBlock
.
cf30b3e
to
d312f7f
Compare
This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. The auxiliary instructions are a new part of the pipeline config, as they are tightly coupled to the pipeline config. An example, where you'll note the `spellcheck` value from the pipeline config has to match across both the pipeline config and the new auxiliary instructions, so we just list both in the same config file. version: "1.0" blocks: ... - name: flatten_auxiliary_columns type: FlattenColumnsBlock config: var_cols: - spellcheck - base_document value_name: corrected_document var_name: dataset_type ... datamixing: auxiliary_instructions: spellcheck: - Correct any spelling errors in the document and output the corrected version. - Rewrite the document to remove any spelling errors. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>
d312f7f
to
4ccdc30
Compare
Rebased and squashed, unmarked as draft to keep this moving along. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. Adding more test cases to exercise _create_auxiliary_dataset would be a good future addition.
This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training.
Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21
Refs #162.