Add support for auxiliary dataset generation #204

bbrowning · 2024-07-25T00:04:04Z

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs #162.

src/instructlab/sdg/generate_data.py

bbrowning · 2024-07-25T00:08:51Z

Checkpointing my work here to let CI chew on this and see if it works on more than just my machine. Any comments and suggestions are welcome, although keeping this marked as draft for now as I'd like to do another pass on this work myself to clean and document things a bit.

bbrowning · 2024-07-25T00:48:06Z

Hmm, it looks like we're missing the expanded schema definitions necessary for the new pipeline blocks added in #182 . Created #205 to track this, as it's a bit orthogonal to this PR's work.

bbrowning · 2024-07-25T01:38:57Z

Added a commit to this PR that at least partially fixes #205 for now. It feels wrong adding that to this PR, and it should likely get pulled out into its own PR perhaps with accompanying test that hands a basic yaml block definition to the schema validation for each block type just to ensure baseline expected block configuration is covered by the schemas. I know we have tox -e validate-pipelines, but that only validates the committed pipeline definitions we ship by default as opposed to validating the universe of possible and/or expected possible upstream or downstream pipeline definitions.

markmc · 2024-07-25T09:05:33Z

Added a commit to this PR that at least partially fixes #205 for now. It feels wrong adding that to this PR, and it should likely get pulled out into its own PR

done in #206

perhaps with accompanying test that hands a basic yaml block definition to the schema validation for each block type just to ensure baseline expected block configuration is covered by the schemas. I know we have tox -e validate-pipelines, but that only validates the committed pipeline definitions we ship by default as opposed to validating the universe of possible and/or expected possible upstream or downstream pipeline definitions.

Filed #207

Thanks!

mergify · 2024-07-25T10:36:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2024-07-25T16:12:14Z

src/instructlab/sdg/datamixing.py

+    object if found or None if the instructions yaml file does not exist.
+    """
+    auxilary_path = resources.files(__package__).joinpath(
+        "configs/knowledge/auxiliary_instructions.yaml"


I think everything else in this "configs" dir is a prompt template? Can we move this one into a new dir?

This is mirroring where these files are already placed by some of our downstream users, so it could be moved but we'll need to coordinate with them to ensure downstream auxiliary instructions gets moved as well.

Naming is hard for me, but I took a first stab at this by moving this from configs/ to instructions/auxiliary_knowledge.yaml.

mergify · 2024-07-25T21:04:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2024-07-26T12:24:54Z

Comment from me on slack:

ITSM there is some sort of connection between the pipeline config and the instructions ... a pipeline author would write those two things together ... some I'd be in favor of the pipeline config providing the instructions location somehow

danmcp · 2024-07-28T18:49:49Z

@bbrowning What's left on your list before this can be moved out of draft state?

markmc · 2024-07-29T13:40:10Z

@bbrowning What's left on your list before this can be moved out of draft state?

One remaining issue is this:

    auxiliary_path = resources.files(__package__).joinpath(
        "instructions/auxiliary_knowledge.yaml"
    )

i.e. the only location the code can load instructions from is this path in the Python package

If downstream custom pipeline configs need different instructions, then we need to be able to load it from the same place the custom pipeline configs are installed. This is how we find pipeline configs:

    pd = platformdirs.PlatformDirs(
        appname=os.path.join("instructlab", "sdg"), multipath=True
    )
    for d in pd.iter_data_dirs():
        pipeline_path = os.path.join(d, "pipelines", pipeline)
        if os.path.exists(pipeline_path):

We also load default_recipes from this system director:

        for d in self.data_dirs:
            default_recipe_path = os.path.join(d, "default_data_recipes", yaml_basename)
            if os.path.exists(default_recipe_path):

so the most obvious thing would be to load from e.g. /usr/share/instructlab/sdg/auxiliary_instructions/knowledge.yaml

I still feel this is suboptimal, and we'll probably want to change this in future so the instructions location is a relative path specified in the pipeline config

markmc · 2024-07-29T14:03:17Z

so the most obvious thing would be to load from e.g. /usr/share/instructlab/sdg/auxiliary_instructions/knowledge.yaml

I started implementing this, but in the case where downstream custom pipelines are installed, then we need to know to use the instructions in the Python package with the full pipeline and then use the appropriate instructions from /usr/share with the downstream custom pipeline. And it's plausible we could have multiple downstream pipelines, each with different instructions.

We definitely need a way to link the pipeline config with the instructions.

markmc · 2024-07-29T14:49:32Z

We definitely need a way to link the pipeline config with the instructions.

824e163 adds the auxiliary instructions to the pipeline config

mergify · 2024-07-29T14:49:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2024-07-29T15:50:10Z

Fixed this e2e failure:

  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/data/generate.py", line 236, in generate
    generate_data(
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/generate_data.py", line 371, in generate_data
    mixer = _mixer_init(ctx, output_dir, date_suffix, sdg_knowledge.auxiliary_inst)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'SDG' object has no attribute 'auxiliary_inst'

src/instructlab/sdg/datamixing.py

markmc · 2024-07-29T15:11:20Z

src/instructlab/sdg/pipelines/full/knowledge.yaml

+    config:
+      columns_map:
+        document: raw_document
+        corrected_document: document


This makes me nervous - columns_map is a dict ... what's guaranteeing the ordering here?

In the datasets library:

def rename(columns): return [column_mapping[col] if col in column_mapping else col for col in columns]

i.e. it's applying these in the order the columns in the dataset

Ok, at least that's deterministic

Agreed, and good catch - that seems quite fragile, and this config should probably be refactored to be an array? Or this split out into two steps, to take out any doubt of the ordering the renaming is applied.

I think this could be done after merging this larger PR, since the order happens to be deterministic. Especially because it may warrant a second look at the API exposed by RenameColumnsBlock.

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. The auxiliary instructions are a new part of the pipeline config, as they are tightly coupled to the pipeline config. An example, where you'll note the `spellcheck` value from the pipeline config has to match across both the pipeline config and the new auxiliary instructions, so we just list both in the same config file. version: "1.0" blocks: ... - name: flatten_auxiliary_columns type: FlattenColumnsBlock config: var_cols: - spellcheck - base_document value_name: corrected_document var_name: dataset_type ... datamixing: auxiliary_instructions: spellcheck: - Correct any spelling errors in the document and output the corrected version. - Rewrite the document to remove any spelling errors. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

bbrowning · 2024-07-29T17:59:39Z

Rebased and squashed, unmarked as draft to keep this moving along.

danmcp

Generally LGTM. Adding more test cases to exercise _create_auxiliary_dataset would be a good future addition.

mergify bot added the testing Relates to testing label Jul 25, 2024

bbrowning commented Jul 25, 2024

View reviewed changes

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved

mergify bot added the ci-failure label Jul 25, 2024

mergify bot added ci-failure and removed ci-failure labels Jul 25, 2024

bbrowning mentioned this pull request Jul 25, 2024

Newer pipeline blocks missing from pipelines schema #205

Closed

bbrowning force-pushed the auxiliary-dataset branch from 81a864c to 0edd68a Compare July 25, 2024 01:13

mergify bot added ci-failure and removed ci-failure labels Jul 25, 2024

mergify bot added the ci-failure label Jul 25, 2024

markmc mentioned this pull request Jul 25, 2024

Input dataset for generate_eval_task_data should be taxonomy samples, not generated data #202

Closed

derekhiggins mentioned this pull request Jul 25, 2024

Generate mmlu bench data with the original samples #209

Merged

mergify bot added the needs-rebase label Jul 25, 2024

bbrowning force-pushed the auxiliary-dataset branch from d9f607b to 38c35da Compare July 25, 2024 15:05

mergify bot removed needs-rebase ci-failure labels Jul 25, 2024

markmc reviewed Jul 25, 2024

View reviewed changes

bbrowning force-pushed the auxiliary-dataset branch from 234144f to 24a43a5 Compare July 25, 2024 19:42

bbrowning mentioned this pull request Jul 25, 2024

[Epic] Support for mixing generated datasets before training #162

Closed

19 tasks

mergify bot added ci-failure needs-rebase labels Jul 25, 2024

bbrowning force-pushed the auxiliary-dataset branch from 24a43a5 to d4e79ee Compare July 25, 2024 22:43

mergify bot removed the needs-rebase label Jul 25, 2024

bbrowning force-pushed the auxiliary-dataset branch from 4e47773 to bccb44c Compare July 26, 2024 01:28

mergify bot removed the needs-rebase label Jul 26, 2024

markmc added this to the 0.2.2 milestone Jul 26, 2024

markmc modified the milestones: 0.2.2, 0.2.3 Jul 28, 2024

mergify bot added the needs-rebase label Jul 29, 2024

markmc force-pushed the auxiliary-dataset branch from 824e163 to 658d7f3 Compare July 29, 2024 14:55

mergify bot added ci-failure and removed needs-rebase labels Jul 29, 2024

markmc force-pushed the auxiliary-dataset branch from 658d7f3 to cf30b3e Compare July 29, 2024 15:49

mergify bot removed the ci-failure label Jul 29, 2024

markmc reviewed Jul 29, 2024

View reviewed changes

mergify bot added the ci-failure label Jul 29, 2024

markmc force-pushed the auxiliary-dataset branch from cf30b3e to d312f7f Compare July 29, 2024 15:55

mergify bot removed the ci-failure label Jul 29, 2024

bbrowning force-pushed the auxiliary-dataset branch from d312f7f to 4ccdc30 Compare July 29, 2024 17:57

bbrowning marked this pull request as ready for review July 29, 2024 17:59

bbrowning mentioned this pull request Jul 29, 2024

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Open

danmcp approved these changes Jul 29, 2024

View reviewed changes

markmc approved these changes Jul 29, 2024

View reviewed changes

markmc merged commit 2a91e7c into instructlab:main Jul 29, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for auxiliary dataset generation #204

Add support for auxiliary dataset generation #204

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

markmc commented Jul 25, 2024

mergify bot commented Jul 25, 2024

markmc Jul 25, 2024

bbrowning Jul 25, 2024

markmc Jul 25, 2024

bbrowning Jul 25, 2024

mergify bot commented Jul 25, 2024

markmc commented Jul 26, 2024

danmcp commented Jul 28, 2024

markmc commented Jul 29, 2024

markmc commented Jul 29, 2024

markmc commented Jul 29, 2024

mergify bot commented Jul 29, 2024

markmc commented Jul 29, 2024

markmc Jul 29, 2024

markmc Jul 29, 2024

bbrowning Jul 29, 2024

bbrowning Jul 29, 2024

bbrowning commented Jul 29, 2024 •

edited

Loading

danmcp left a comment

Add support for auxiliary dataset generation #204

Add support for auxiliary dataset generation #204

Conversation

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

markmc commented Jul 25, 2024

mergify bot commented Jul 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jul 25, 2024

markmc commented Jul 26, 2024

danmcp commented Jul 28, 2024

markmc commented Jul 29, 2024

markmc commented Jul 29, 2024

markmc commented Jul 29, 2024

mergify bot commented Jul 29, 2024

markmc commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbrowning commented Jul 29, 2024 • edited Loading

danmcp left a comment

Choose a reason for hiding this comment

bbrowning commented Jul 29, 2024 •

edited

Loading