Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a way to mix generated datasets before sending to training #163

Merged
merged 1 commit into from
Jul 24, 2024

Conversation

bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Jul 18, 2024

See #162

Rebased from aakankshaduggal#4, aakankshaduggal#13, and aakankshaduggal#14 reducing scope and splitting what can out into subsequent follow-up PRs for the overall data mixing functionality instead of landing everything as one big PR.

Refactored and changed by me to be as least disruptive as possible in merging in, which means leaving the legacy train and messages outputs in place in generated_data.py.

This introduces a new Recipe class that references multiple generated datasets and has logic to combine those into a single mixed dataset based on a configurable number / ratio of samples to take from each dataset.

This adds new output artifacts for each data generation run:

  • a node_datasets_* subfolder that contains the raw generated samples from each taxonomy leaf node as a set of jsonl files - within here, each skill taxonomy node gets a single jsonl and knowledge nodes get two jsonl files used in separate downstream steps of training
  • a skills_train_msgs_*.jsonl and a knowledge_train_msgs_*.jsonl file that contains mixtures of the above raw generated samples based on proportions specified during the mixing process
Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>

@bbrowning
Copy link
Contributor Author

The e2e test finishes synthetic data generation now, although fails during training because of /tmp/tmp.JJXU361JVM/.local/share/instructlab/datasets does not contain training or test files, did you run ilab data generate? This is because the mixed files do not match the train_* or test_* prefixes that it expects those files to have, and instead have names like skills_train_msgs_2024-07-18T16_10_49.jsonl or knowledge_train_msgs_2024-07-18T16_10_49.jsonl.

@mergify mergify bot added the needs-rebase label Jul 18, 2024
Copy link
Contributor

mergify bot commented Jul 18, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments all over the place, not terribly useful I'm afraid. Will try to follow up with some higher-level comments

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/datamixing.py Outdated Show resolved Hide resolved
MANIFEST.in Outdated Show resolved Hide resolved
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
skills_phase_data.to_json(skills_fpath, orient="records", lines=True)

knowledge_recipe.add_dataset(knowledge_fpath)
skills_recipe.add_dataset(skills_fpath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the roundtrip to disk necessary? i.e. we're writing to the dataset to a file just for it to be loaded again shortly after?

The benefit is that we have a record of this intermediate stage? Or reduced memory usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it's so we have an artifact trail of the raw generated data as well as the mixed data, and when #185 goes in the recipe files will reference this raw generated data to show what was mixed into the overall mixed dataset. The mixing process can reduce the number of samples from a given taxonomy leaf in the overall mixed dataset, so these intermediate files also allow the entire generated output for each leaf node to be retained as an artifact.

Reduce memory usage is the other side-effect of this, since we're only ever keeping the dataset for a single leaf node in memory until the mixing process, where we sample some number/percent of the overall generated samples for each leaf instead of loading the entirety of all leaf node generated samples at once.

src/instructlab/sdg/utils/datamixing.py Outdated Show resolved Hide resolved
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/parse_and_convert.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/parse_and_convert.py Outdated Show resolved Hide resolved
@markmc
Copy link
Contributor

markmc commented Jul 19, 2024

I think a dev-doc is going to be the quickest way for me to get comfortable with this - if we merged it in anything like it's current form, I feel like a bunch of follow-up bug-fixing and refactoring will be needed ... and I know I wouldn't feel like I could do that safely without the kind of additional context I'd expect in a dev-doc

I think there's probably also an opportunity to pull stuff out get to a more KISS starting point (that could be merged more quickly) - e.g. if we configured the initial datasets in the pipeline configs, and omitted the writing of intermediate datasets and recipe files to disk, we wouldn't need a whole new recipe file format?

@mergify mergify bot added the needs-rebase label Jul 19, 2024
Copy link
Contributor

mergify bot commented Jul 19, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@russellb russellb added this to the 0.2.0 milestone Jul 22, 2024
@mergify mergify bot added ci-failure and removed needs-rebase labels Jul 22, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Jul 22, 2024
Copy link
Contributor

@derekhiggins derekhiggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see the e2e runthough to understand whats going on here

src/instructlab/sdg/generate_data.py Show resolved Hide resolved
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
if is_knowledge:
knowledge_phase_data = _create_phase07_ds(logger, new_generated_data)
output_file_leaf_knowledge = (
f"node_datasets_{date_suffix}/node_{i}_p07.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing context here but I'd love to know the signifigance of p07 and p10 below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have full context on why they're named this, but did push some additional changes that attempts to document what's in these p07 and p10 datasets at least.

@markmc markmc modified the milestones: 0.2.0, 0.2.1 Jul 23, 2024
@mergify mergify bot removed the ci-failure label Jul 23, 2024
Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

derekhiggins pushed a commit to derekhiggins/sdg that referenced this pull request Jul 23, 2024
This makes more sense if we want to use PipelineContext to separately
initialize the mmlu-bench pipeline.

Suggestion from @bbrowning in instructlab#163.

Signed-off-by: Mark McLoughlin <[email protected]>
@bbrowning
Copy link
Contributor Author

bbrowning commented Jul 23, 2024

I think I've addressed most of the initial feedback, although still waiting on a couple of answers from others to ensure I've documented things properly and didn't prune too much out in my attempts to streamline this. I've also parked a copy of my fork's branch at https://github.com/bbrowning/instructlab-sdg/commits/data-mixing-full-history/ up to this point for reference to myself or anyone else that works on follow-up PRs to add back changes that I'm pruning out of scope for this one.

Now, on to the squash/rebase phase to get the list of commits down to a reasonable number and ensure authorship is correctly attributed. Once that's done, I think this will be in decent shape for additional review and merging consideration, with the understanding that it's not perfect but has to get in under specific time constraints and is a prerequisite for some smaller follow-up PRs around recipe files, system prompts, and removal of legacy train/messages formats as well as overall integration testing end-to-end.

Copy link
Contributor

mergify bot commented Jul 23, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@bbrowning bbrowning force-pushed the data-mixing branch 3 times, most recently from e49a1e0 to 9d3dd74 Compare July 23, 2024 19:22
@mergify mergify bot added ci-failure and removed needs-rebase labels Jul 23, 2024
@mergify mergify bot removed the ci-failure label Jul 23, 2024
@bbrowning bbrowning marked this pull request as ready for review July 23, 2024 20:32
@bbrowning
Copy link
Contributor Author

Ok, I feel like this is in a point where it's good enough to review. I'll be ready to address any concerns quickly to keep round-trip time of that whole process as short as possible. After merging we'll need some follow-up PRs on top of this work (and tracked in #162) to write out recipe yaml files, add auxiliary datasets, add precomputed datasets, and remove the legacy train/messages json output formats.

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
return knowledge_ds


def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT offers 2 possible explanations for what "raft" could mean here:

Random Augmented Feature Training (RAFT): In machine learning, "RAFT" might stand for Random Augmented Feature Training, a technique where random augmentation is applied to the features of a dataset to improve the model's robustness and performance. The function's description suggests that additional context from random samples is added, which could be a form of data augmentation.

Reference Augmented Fine-Tuning (RAFT): In NLP, RAFT might refer to adding additional context or references to enhance the training data. By incorporating context from other samples, the function could be aiming to provide a more comprehensive background, potentially improving the quality of responses generated by a language model.

Seems like it's the latter? Expanding the acronym and linking to https://aka.ms/raft-paper would help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this in the spirit of that RAFT paper, but I'll give this a different name that doesn't require understanding the acronym. Something like _add_extra_contexts_to_samples since it's adding extra contexts to each sample in the dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also

 metadata["dataset"] += f"_raft_p{p}"

Even expanding the acronym in the docstring to explain that naming would be useful 👍

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
@markmc
Copy link
Contributor

markmc commented Jul 24, 2024

Looking into:

 File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/generate_data.py", line 404, in generate_data
    mixer.generate()
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/datamixing.py", line 468, in generate
    self._gen_mixed_data(
...
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/datamixing.py", line 38, in _load_ds
    dataset = load_dataset("json", data_files=path, split="train")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/datasets/data_files.py", line 411, in resolve_pattern
    raise FileNotFoundError(error_msg)
FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

@markmc
Copy link
Contributor

markmc commented Jul 24, 2024

Looking into:

FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

I think it's this:

-        recipe.add_dataset(output_file_leaf_node, sampling_size)
+        recipe.add_dataset(output_file, sampling_size)

@bbrowning
Copy link
Contributor Author

Looking into:

FileNotFoundError: Unable to find '/home/runner/work/sdg/sdg/node_datasets_2024-07-24T14_23_22/compositional_skills_extraction_inference_qualitative_e2e-siblings.jsonl'

I think it's this:

-        recipe.add_dataset(output_file_leaf_node, sampling_size)
+        recipe.add_dataset(output_file, sampling_size)

I hit this error locally when testing this a minute ago, and making the change above fixes things on my local machine.

@markmc markmc changed the title Add support for mixing generated datasets Introduce a way to mix generated datasets before sending to training Jul 24, 2024
Rebased from
aakankshaduggal#4
aakankshaduggal#13
aakankshaduggal#14

Refactored and changed by me to be as least disruptive as possible in
merging in, which means leaving the legacy train and messages outputs in
place in generated_data.py.

This introduces a new Recipe class that references multiple generated
datasets and has logic to combine those into a single mixed dataset
based on a configurable number / ratio of samples to take from each
dataset.

This adds new output artifacts for each data generation run:
* a `node_datasets_*` subfolder that contains the raw generated samples
  from each taxonomy leaf node as a set of jsonl files - within here,
  each skill taxonomy node gets a single jsonl and knowledge nodes get
  two jsonl files used in separate downstream steps of training
* a `skills_train_msgs_*.jsonl` and a `knowledge_train_msgs_*.jsonl`
  file that contains mixtures of the above raw generated samples based
  on proportions specified during the mixing process

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with merging this as-is when CI passes, I think the cleanups, documentation, refactoring, scope reduction, etc. has gotten to a place which this is much more maintainable going forward. Thanks, Ben!


dataset = dataset.map(_move_unallowed_cols_to_metadata, num_proc=num_proc)

# check if metadata column is string if not convert it using json.dumps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to see what situation this code is trying to handle? Seems like metadata can be a dict somehow?

return knowledge_ds


def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also

 metadata["dataset"] += f"_raft_p{p}"

Even expanding the acronym in the docstring to explain that naming would be useful 👍

@bbrowning
Copy link
Contributor Author

Ok, pushing the big green merge button on this for now as it's a prerequisite for a number of other PRs. Thanks for all the great feedback, and will keep cleaning this up with the work that builds on top of this.

@bbrowning bbrowning merged commit cbee53c into instructlab:main Jul 24, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants