Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for auxiliary dataset generation #204

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

bbrowning
Copy link
Contributor

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs #162.

@mergify mergify bot added the testing Relates to testing label Jul 25, 2024
@mergify mergify bot added the ci-failure label Jul 25, 2024
@bbrowning
Copy link
Contributor Author

Checkpointing my work here to let CI chew on this and see if it works on more than just my machine. Any comments and suggestions are welcome, although keeping this marked as draft for now as I'd like to do another pass on this work myself to clean and document things a bit.

@bbrowning
Copy link
Contributor Author

Hmm, it looks like we're missing the expanded schema definitions necessary for the new pipeline blocks added in #182 . Created #205 to track this, as it's a bit orthogonal to this PR's work.

@bbrowning bbrowning force-pushed the auxiliary-dataset branch from 81a864c to 0edd68a Compare July 25, 2024 01:13
@mergify mergify bot added ci-failure and removed ci-failure labels Jul 25, 2024
@bbrowning
Copy link
Contributor Author

Added a commit to this PR that at least partially fixes #205 for now. It feels wrong adding that to this PR, and it should likely get pulled out into its own PR perhaps with accompanying test that hands a basic yaml block definition to the schema validation for each block type just to ensure baseline expected block configuration is covered by the schemas. I know we have tox -e validate-pipelines, but that only validates the committed pipeline definitions we ship by default as opposed to validating the universe of possible and/or expected possible upstream or downstream pipeline definitions.

@mergify mergify bot added the ci-failure label Jul 25, 2024
@markmc
Copy link
Contributor

markmc commented Jul 25, 2024

Added a commit to this PR that at least partially fixes #205 for now. It feels wrong adding that to this PR, and it should likely get pulled out into its own PR

done in #206

perhaps with accompanying test that hands a basic yaml block definition to the schema validation for each block type just to ensure baseline expected block configuration is covered by the schemas. I know we have tox -e validate-pipelines, but that only validates the committed pipeline definitions we ship by default as opposed to validating the universe of possible and/or expected possible upstream or downstream pipeline definitions.

Filed #207

Thanks!

Copy link
Contributor

mergify bot commented Jul 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

object if found or None if the instructions yaml file does not exist.
"""
auxilary_path = resources.files(__package__).joinpath(
"configs/knowledge/auxiliary_instructions.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything else in this "configs" dir is a prompt template? Can we move this one into a new dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mirroring where these files are already placed by some of our downstream users, so it could be moved but we'll need to coordinate with them to ensure downstream auxiliary instructions gets moved as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard for me, but I took a first stab at this by moving this from configs/ to instructions/auxiliary_knowledge.yaml.

Copy link
Contributor

mergify bot commented Jul 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@bbrowning bbrowning force-pushed the auxiliary-dataset branch from 24a43a5 to d4e79ee Compare July 25, 2024 22:43
@mergify mergify bot removed the needs-rebase label Jul 25, 2024
@bbrowning bbrowning force-pushed the auxiliary-dataset branch from 4e47773 to bccb44c Compare July 26, 2024 01:28
@mergify mergify bot removed the needs-rebase label Jul 26, 2024
@markmc markmc added this to the 0.2.2 milestone Jul 26, 2024
@markmc
Copy link
Contributor

markmc commented Jul 26, 2024

Comment from me on slack:

ITSM there is some sort of connection between the pipeline config and the instructions ... a pipeline author would write those two things together ... some I'd be in favor of the pipeline config providing the instructions location somehow

@danmcp
Copy link
Member

danmcp commented Jul 28, 2024

@bbrowning What's left on your list before this can be moved out of draft state?

@markmc markmc modified the milestones: 0.2.2, 0.2.3 Jul 28, 2024
@markmc
Copy link
Contributor

markmc commented Jul 29, 2024

@bbrowning What's left on your list before this can be moved out of draft state?

One remaining issue is this:

    auxiliary_path = resources.files(__package__).joinpath(
        "instructions/auxiliary_knowledge.yaml"
    )

i.e. the only location the code can load instructions from is this path in the Python package

If downstream custom pipeline configs need different instructions, then we need to be able to load it from the same place the custom pipeline configs are installed. This is how we find pipeline configs:

    pd = platformdirs.PlatformDirs(
        appname=os.path.join("instructlab", "sdg"), multipath=True
    )
    for d in pd.iter_data_dirs():
        pipeline_path = os.path.join(d, "pipelines", pipeline)
        if os.path.exists(pipeline_path):

We also load default_recipes from this system director:

        for d in self.data_dirs:
            default_recipe_path = os.path.join(d, "default_data_recipes", yaml_basename)
            if os.path.exists(default_recipe_path):

so the most obvious thing would be to load from e.g. /usr/share/instructlab/sdg/auxiliary_instructions/knowledge.yaml

I still feel this is suboptimal, and we'll probably want to change this in future so the instructions location is a relative path specified in the pipeline config

@markmc
Copy link
Contributor

markmc commented Jul 29, 2024

so the most obvious thing would be to load from e.g. /usr/share/instructlab/sdg/auxiliary_instructions/knowledge.yaml

I started implementing this, but in the case where downstream custom pipelines are installed, then we need to know to use the instructions in the Python package with the full pipeline and then use the appropriate instructions from /usr/share with the downstream custom pipeline. And it's plausible we could have multiple downstream pipelines, each with different instructions.

We definitely need a way to link the pipeline config with the instructions.

@markmc
Copy link
Contributor

markmc commented Jul 29, 2024

We definitely need a way to link the pipeline config with the instructions.

824e163 adds the auxiliary instructions to the pipeline config

Copy link
Contributor

mergify bot commented Jul 29, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 29, 2024
@markmc markmc force-pushed the auxiliary-dataset branch from 824e163 to 658d7f3 Compare July 29, 2024 14:55
@mergify mergify bot added ci-failure and removed needs-rebase labels Jul 29, 2024
@markmc markmc force-pushed the auxiliary-dataset branch from 658d7f3 to cf30b3e Compare July 29, 2024 15:49
@mergify mergify bot removed the ci-failure label Jul 29, 2024
@markmc
Copy link
Contributor

markmc commented Jul 29, 2024

Fixed this e2e failure:

  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/data/generate.py", line 236, in generate
    generate_data(
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/generate_data.py", line 371, in generate_data
    mixer = _mixer_init(ctx, output_dir, date_suffix, sdg_knowledge.auxiliary_inst)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'SDG' object has no attribute 'auxiliary_inst'

src/instructlab/sdg/datamixing.py Show resolved Hide resolved
src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved
config:
columns_map:
document: raw_document
corrected_document: document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me nervous - columns_map is a dict ... what's guaranteeing the ordering here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the datasets library:

        def rename(columns):
            return [column_mapping[col] if col in column_mapping else col for col in columns]

i.e. it's applying these in the order the columns in the dataset

Ok, at least that's deterministic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and good catch - that seems quite fragile, and this config should probably be refactored to be an array? Or this split out into two steps, to take out any doubt of the ordering the renaming is applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be done after merging this larger PR, since the order happens to be deterministic. Especially because it may warrant a second look at the API exposed by RenameColumnsBlock.

@mergify mergify bot added the ci-failure label Jul 29, 2024
@markmc markmc force-pushed the auxiliary-dataset branch from cf30b3e to d312f7f Compare July 29, 2024 15:55
@mergify mergify bot removed the ci-failure label Jul 29, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

The auxiliary instructions are a new part of the pipeline config, as
they are tightly coupled to the pipeline config. An example, where
you'll note the `spellcheck` value from the pipeline config has to match
across both the pipeline config and the new auxiliary instructions, so
we just list both in the same config file.

version: "1.0"
blocks:
...
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type
...
datamixing:
  auxiliary_instructions:
    spellcheck:
      - Correct any spelling errors in the document and output the corrected version.
      - Rewrite the document to remove any spelling errors.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning force-pushed the auxiliary-dataset branch from d312f7f to 4ccdc30 Compare July 29, 2024 17:57
@bbrowning bbrowning marked this pull request as ready for review July 29, 2024 17:59
@bbrowning
Copy link
Contributor Author

bbrowning commented Jul 29, 2024

Rebased and squashed, unmarked as draft to keep this moving along.

Copy link
Member

@danmcp danmcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Adding more test cases to exercise _create_auxiliary_dataset would be a good future addition.

@markmc markmc merged commit 2a91e7c into instructlab:main Jul 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants