Introduce a way to mix generated datasets before sending to training #163

bbrowning · 2024-07-18T01:00:50Z

Rebased from aakankshaduggal#4, aakankshaduggal#13, and aakankshaduggal#14 reducing scope and splitting what can out into subsequent follow-up PRs for the overall data mixing functionality instead of landing everything as one big PR.

Refactored and changed by me to be as least disruptive as possible in merging in, which means leaving the legacy train and messages outputs in place in generated_data.py.

This introduces a new Recipe class that references multiple generated datasets and has logic to combine those into a single mixed dataset based on a configurable number / ratio of samples to take from each dataset.

This adds new output artifacts for each data generation run:

a node_datasets_* subfolder that contains the raw generated samples from each taxonomy leaf node as a set of jsonl files - within here, each skill taxonomy node gets a single jsonl and knowledge nodes get two jsonl files used in separate downstream steps of training
a skills_train_msgs_*.jsonl and a knowledge_train_msgs_*.jsonl file that contains mixtures of the above raw generated samples based on proportions specified during the mixing process

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>

bbrowning · 2024-07-18T17:18:15Z

The e2e test finishes synthetic data generation now, although fails during training because of /tmp/tmp.JJXU361JVM/.local/share/instructlab/datasets does not contain training or test files, did you run ilab data generate? This is because the mixed files do not match the train_* or test_* prefixes that it expects those files to have, and instead have names like skills_train_msgs_2024-07-18T16_10_49.jsonl or knowledge_train_msgs_2024-07-18T16_10_49.jsonl.

mergify · 2024-07-18T21:15:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc

A bunch of comments all over the place, not terribly useful I'm afraid. Will try to follow up with some higher-level comments

src/instructlab/sdg/generate_data.py

src/instructlab/sdg/utils/datamixing.py

MANIFEST.in

src/instructlab/sdg/generate_data.py

markmc · 2024-07-19T12:13:03Z

src/instructlab/sdg/generate_data.py

+            skills_phase_data.to_json(skills_fpath, orient="records", lines=True)
+
+            knowledge_recipe.add_dataset(knowledge_fpath)
+            skills_recipe.add_dataset(skills_fpath)


Is the roundtrip to disk necessary? i.e. we're writing to the dataset to a file just for it to be loaded again shortly after?

The benefit is that we have a record of this intermediate stage? Or reduced memory usage?

My understanding is that it's so we have an artifact trail of the raw generated data as well as the mixed data, and when #185 goes in the recipe files will reference this raw generated data to show what was mixed into the overall mixed dataset. The mixing process can reduce the number of samples from a given taxonomy leaf in the overall mixed dataset, so these intermediate files also allow the entire generated output for each leaf node to be retained as an artifact.

Reduce memory usage is the other side-effect of this, since we're only ever keeping the dataset for a single leaf node in memory until the mixing process, where we sample some number/percent of the overall generated samples for each leaf instead of loading the entirety of all leaf node generated samples at once.

src/instructlab/sdg/utils/datamixing.py

src/instructlab/sdg/generate_data.py

src/instructlab/sdg/utils/parse_and_convert.py

markmc · 2024-07-19T12:53:31Z

I think a dev-doc is going to be the quickest way for me to get comfortable with this - if we merged it in anything like it's current form, I feel like a bunch of follow-up bug-fixing and refactoring will be needed ... and I know I wouldn't feel like I could do that safely without the kind of additional context I'd expect in a dev-doc

I think there's probably also an opportunity to pull stuff out get to a more KISS starting point (that could be merged more quickly) - e.g. if we configured the initial datasets in the pipeline configs, and omitted the writing of intermediate datasets and recipe files to disk, we wouldn't need a whole new recipe file format?

mergify · 2024-07-19T22:17:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

derekhiggins

I think I need to see the e2e runthough to understand whats going on here

src/instructlab/sdg/generate_data.py

derekhiggins · 2024-07-23T09:19:04Z

src/instructlab/sdg/generate_data.py

+        if is_knowledge:
+            knowledge_phase_data = _create_phase07_ds(logger, new_generated_data)
+            output_file_leaf_knowledge = (
+                f"node_datasets_{date_suffix}/node_{i}_p07.jsonl"


I'm missing context here but I'd love to know the signifigance of p07 and p10 below

I don't have full context on why they're named this, but did push some additional changes that attempts to document what's in these p07 and p10 datasets at least.

github-actions · 2024-07-23T14:07:58Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

@bbrowning

This makes more sense if we want to use PipelineContext to separately initialize the mmlu-bench pipeline. Suggestion from @bbrowning in instructlab#163. Signed-off-by: Mark McLoughlin <[email protected]>

bbrowning · 2024-07-23T15:52:28Z

I think I've addressed most of the initial feedback, although still waiting on a couple of answers from others to ensure I've documented things properly and didn't prune too much out in my attempts to streamline this. I've also parked a copy of my fork's branch at https://github.com/bbrowning/instructlab-sdg/commits/data-mixing-full-history/ up to this point for reference to myself or anyone else that works on follow-up PRs to add back changes that I'm pruning out of scope for this one.

Now, on to the squash/rebase phase to get the list of commits down to a reasonable number and ensure authorship is correctly attributed. Once that's done, I think this will be in decent shape for additional review and merging consideration, with the understanding that it's not perfect but has to get in under specific time constraints and is a prerequisite for some smaller follow-up PRs around recipe files, system prompts, and removal of legacy train/messages formats as well as overall integration testing end-to-end.

mergify · 2024-07-23T18:11:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bbrowning · 2024-07-23T20:35:00Z

Ok, I feel like this is in a point where it's good enough to review. I'll be ready to address any concerns quickly to keep round-trip time of that whole process as short as possible. After merging we'll need some follow-up PRs on top of this work (and tracked in #162) to write out recipe yaml files, add auxiliary datasets, add precomputed datasets, and remove the legacy train/messages json output formats.

src/instructlab/sdg/generate_data.py

markmc · 2024-07-24T12:06:28Z

src/instructlab/sdg/generate_data.py

+    return knowledge_ds
+
+
+def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):


ChatGPT offers 2 possible explanations for what "raft" could mean here:

Random Augmented Feature Training (RAFT): In machine learning, "RAFT" might stand for Random Augmented Feature Training, a technique where random augmentation is applied to the features of a dataset to improve the model's robustness and performance. The function's description suggests that additional context from random samples is added, which could be a form of data augmentation.

Reference Augmented Fine-Tuning (RAFT): In NLP, RAFT might refer to adding additional context or references to enhance the training data. By incorporating context from other samples, the function could be aiming to provide a more comprehensive background, potentially improving the quality of responses generated by a language model.

Seems like it's the latter? Expanding the acronym and linking to https://aka.ms/raft-paper would help

I do think this in the spirit of that RAFT paper, but I'll give this a different name that doesn't require understanding the acronym. Something like _add_extra_contexts_to_samples since it's adding extra contexts to each sample in the dataset.

See also

metadata["dataset"] += f"_raft_p{p}"

Even expanding the acronym in the docstring to explain that naming would be useful 👍

src/instructlab/sdg/generate_data.py

markmc · 2024-07-24T14:45:52Z

Looking into:

 File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/generate_data.py", line 404, in generate_data
    mixer.generate()
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/datamixing.py", line 468, in generate
    self._gen_mixed_data(
...
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/datamixing.py", line 38, in _load_ds
    dataset = load_dataset("json", data_files=path, split="train")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/datasets/data_files.py", line 411, in resolve_pattern
    raise FileNotFoundError(error_msg)
FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

markmc · 2024-07-24T14:50:16Z

Looking into:

FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

I think it's this:

-        recipe.add_dataset(output_file_leaf_node, sampling_size)
+        recipe.add_dataset(output_file, sampling_size)

bbrowning · 2024-07-24T15:01:10Z

Looking into:

FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

I think it's this:

-        recipe.add_dataset(output_file_leaf_node, sampling_size)
+        recipe.add_dataset(output_file, sampling_size)

I hit this error locally when testing this a minute ago, and making the change above fixes things on my local machine.

Rebased from aakankshaduggal#4 aakankshaduggal#13 aakankshaduggal#14 Refactored and changed by me to be as least disruptive as possible in merging in, which means leaving the legacy train and messages outputs in place in generated_data.py. This introduces a new Recipe class that references multiple generated datasets and has logic to combine those into a single mixed dataset based on a configurable number / ratio of samples to take from each dataset. This adds new output artifacts for each data generation run: * a `node_datasets_*` subfolder that contains the raw generated samples from each taxonomy leaf node as a set of jsonl files - within here, each skill taxonomy node gets a single jsonl and knowledge nodes get two jsonl files used in separate downstream steps of training * a `skills_train_msgs_*.jsonl` and a `knowledge_train_msgs_*.jsonl` file that contains mixtures of the above raw generated samples based on proportions specified during the mixing process Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

markmc

I'm good with merging this as-is when CI passes, I think the cleanups, documentation, refactoring, scope reduction, etc. has gotten to a place which this is much more maintainable going forward. Thanks, Ben!

markmc · 2024-07-24T14:29:06Z

src/instructlab/sdg/datamixing.py

+
+    dataset = dataset.map(_move_unallowed_cols_to_metadata, num_proc=num_proc)
+
+    # check if metadata column is string if not convert it using json.dumps


Hard to see what situation this code is trying to handle? Seems like metadata can be a dict somehow?

markmc · 2024-07-24T15:43:47Z

src/instructlab/sdg/generate_data.py

+    return knowledge_ds
+
+
+def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):


See also

metadata["dataset"] += f"_raft_p{p}"

Even expanding the acronym in the docstring to explain that naming would be useful 👍

bbrowning · 2024-07-24T18:02:59Z

Ok, pushing the big green merge button on this for now as it's a prerequisite for a number of other PRs. Thanks for all the great feedback, and will keep cleaning this up with the work that builds on top of this.

mergify bot added the ci-failure label Jul 18, 2024

bbrowning mentioned this pull request Jul 18, 2024

[Epic] Support for mixing generated datasets before training #162

Closed

19 tasks

bbrowning force-pushed the data-mixing branch 2 times, most recently from 1028021 to cf42ea3 Compare July 18, 2024 15:51

bbrowning mentioned this pull request Jul 18, 2024

Provide a conversion utility to ensure that the datasets being provided to legacy train are compatible instructlab/instructlab#1644

Merged

mergify bot added the needs-rebase label Jul 18, 2024

bbrowning force-pushed the data-mixing branch from 3baa697 to 4de5da0 Compare July 18, 2024 21:21

mergify bot removed the needs-rebase label Jul 18, 2024

bbrowning mentioned this pull request Jul 19, 2024

Add precomputed dataset to skills data generation #171

Open

bbrowning force-pushed the data-mixing branch from cfe5738 to 394ea8e Compare July 19, 2024 12:37

mergify bot removed the ci-failure label Jul 19, 2024

markmc reviewed Jul 19, 2024

View reviewed changes

mergify bot added the needs-rebase label Jul 19, 2024

russellb added this to the 0.2.0 milestone Jul 22, 2024

bbrowning force-pushed the data-mixing branch from a59a602 to 385e22e Compare July 22, 2024 17:08

mergify bot added ci-failure and removed needs-rebase labels Jul 22, 2024

bbrowning force-pushed the data-mixing branch from 385e22e to 1e10b59 Compare July 22, 2024 17:41

mergify bot added ci-failure and removed ci-failure labels Jul 22, 2024

derekhiggins reviewed Jul 23, 2024

View reviewed changes

markmc modified the milestones: 0.2.0, 0.2.1 Jul 23, 2024

mergify bot removed the ci-failure label Jul 23, 2024

bbrowning mentioned this pull request Jul 23, 2024

Write Recipe files during data mixing #185

Closed

markmc mentioned this pull request Jul 23, 2024

Create a utility function to convert from Pandas dataframe to Hugging Face dataset #190

Closed

mergify bot added the needs-rebase label Jul 23, 2024

bbrowning force-pushed the data-mixing branch 3 times, most recently from e49a1e0 to 9d3dd74 Compare July 23, 2024 19:22

mergify bot added ci-failure and removed needs-rebase labels Jul 23, 2024

bbrowning force-pushed the data-mixing branch from 9d3dd74 to 7ecf997 Compare July 23, 2024 19:30

mergify bot removed the ci-failure label Jul 23, 2024

bbrowning marked this pull request as ready for review July 23, 2024 20:32

markmc reviewed Jul 24, 2024

View reviewed changes

markmc force-pushed the data-mixing branch from b01f1ca to eb2f860 Compare July 24, 2024 14:06

mergify bot added the ci-failure label Jul 24, 2024

markmc force-pushed the data-mixing branch from eb2f860 to ac3438d Compare July 24, 2024 14:50

mergify bot removed the ci-failure label Jul 24, 2024

markmc changed the title ~~Add support for mixing generated datasets~~ Introduce a way to mix generated datasets before sending to training Jul 24, 2024

bbrowning force-pushed the data-mixing branch from ac3438d to a15c83e Compare July 24, 2024 16:54

markmc approved these changes Jul 24, 2024

View reviewed changes

aakankshaduggal mentioned this pull request Jul 24, 2024

Pull taxonomy precomputed dataset from hugging face #201

Open

bbrowning merged commit cbee53c into instructlab:main Jul 24, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a way to mix generated datasets before sending to training #163

Introduce a way to mix generated datasets before sending to training #163

bbrowning commented Jul 18, 2024 •

edited by markmc

Loading

bbrowning commented Jul 18, 2024

mergify bot commented Jul 18, 2024

markmc left a comment

markmc Jul 19, 2024

bbrowning Jul 23, 2024

markmc commented Jul 19, 2024

mergify bot commented Jul 19, 2024

derekhiggins left a comment

derekhiggins Jul 23, 2024

bbrowning Jul 23, 2024

github-actions bot commented Jul 23, 2024

bbrowning commented Jul 23, 2024 •

edited

Loading

mergify bot commented Jul 23, 2024

bbrowning commented Jul 23, 2024

markmc Jul 24, 2024

bbrowning Jul 24, 2024

markmc Jul 24, 2024

markmc commented Jul 24, 2024

markmc commented Jul 24, 2024

bbrowning commented Jul 24, 2024

markmc left a comment

markmc Jul 24, 2024

markmc Jul 24, 2024

bbrowning commented Jul 24, 2024

		return knowledge_ds


		def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):


		dataset = dataset.map(_move_unallowed_cols_to_metadata, num_proc=num_proc)

		# check if metadata column is string if not convert it using json.dumps

Introduce a way to mix generated datasets before sending to training #163

Introduce a way to mix generated datasets before sending to training #163

Conversation

bbrowning commented Jul 18, 2024 • edited by markmc Loading

bbrowning commented Jul 18, 2024

mergify bot commented Jul 18, 2024

markmc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markmc commented Jul 19, 2024

mergify bot commented Jul 19, 2024

derekhiggins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 23, 2024

bbrowning commented Jul 23, 2024 • edited Loading

mergify bot commented Jul 23, 2024

bbrowning commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markmc commented Jul 24, 2024

markmc commented Jul 24, 2024

bbrowning commented Jul 24, 2024

markmc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbrowning commented Jul 24, 2024

bbrowning commented Jul 18, 2024 •

edited by markmc

Loading

bbrowning commented Jul 23, 2024 •

edited

Loading