-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a way to mix generated datasets before sending to training #163
Conversation
1028021
to
cf42ea3
Compare
The e2e test finishes synthetic data generation now, although fails during training because of |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of comments all over the place, not terribly useful I'm afraid. Will try to follow up with some higher-level comments
src/instructlab/sdg/generate_data.py
Outdated
skills_phase_data.to_json(skills_fpath, orient="records", lines=True) | ||
|
||
knowledge_recipe.add_dataset(knowledge_fpath) | ||
skills_recipe.add_dataset(skills_fpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the roundtrip to disk necessary? i.e. we're writing to the dataset to a file just for it to be loaded again shortly after?
The benefit is that we have a record of this intermediate stage? Or reduced memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it's so we have an artifact trail of the raw generated data as well as the mixed data, and when #185 goes in the recipe files will reference this raw generated data to show what was mixed into the overall mixed dataset. The mixing process can reduce the number of samples from a given taxonomy leaf in the overall mixed dataset, so these intermediate files also allow the entire generated output for each leaf node to be retained as an artifact.
Reduce memory usage is the other side-effect of this, since we're only ever keeping the dataset for a single leaf node in memory until the mixing process, where we sample some number/percent of the overall generated samples for each leaf instead of loading the entirety of all leaf node generated samples at once.
I think a dev-doc is going to be the quickest way for me to get comfortable with this - if we merged it in anything like it's current form, I feel like a bunch of follow-up bug-fixing and refactoring will be needed ... and I know I wouldn't feel like I could do that safely without the kind of additional context I'd expect in a dev-doc I think there's probably also an opportunity to pull stuff out get to a more KISS starting point (that could be merged more quickly) - e.g. if we configured the initial datasets in the pipeline configs, and omitted the writing of intermediate datasets and recipe files to disk, we wouldn't need a whole new recipe file format? |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to see the e2e runthough to understand whats going on here
src/instructlab/sdg/generate_data.py
Outdated
if is_knowledge: | ||
knowledge_phase_data = _create_phase07_ds(logger, new_generated_data) | ||
output_file_leaf_knowledge = ( | ||
f"node_datasets_{date_suffix}/node_{i}_p07.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing context here but I'd love to know the signifigance of p07 and p10 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have full context on why they're named this, but did push some additional changes that attempts to document what's in these p07 and p10 datasets at least.
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
This makes more sense if we want to use PipelineContext to separately initialize the mmlu-bench pipeline. Suggestion from @bbrowning in instructlab#163. Signed-off-by: Mark McLoughlin <[email protected]>
I think I've addressed most of the initial feedback, although still waiting on a couple of answers from others to ensure I've documented things properly and didn't prune too much out in my attempts to streamline this. I've also parked a copy of my fork's branch at https://github.com/bbrowning/instructlab-sdg/commits/data-mixing-full-history/ up to this point for reference to myself or anyone else that works on follow-up PRs to add back changes that I'm pruning out of scope for this one. Now, on to the squash/rebase phase to get the list of commits down to a reasonable number and ensure authorship is correctly attributed. Once that's done, I think this will be in decent shape for additional review and merging consideration, with the understanding that it's not perfect but has to get in under specific time constraints and is a prerequisite for some smaller follow-up PRs around recipe files, system prompts, and removal of legacy train/messages formats as well as overall integration testing end-to-end. |
This pull request has merge conflicts that must be resolved before it can be |
e49a1e0
to
9d3dd74
Compare
Ok, I feel like this is in a point where it's good enough to review. I'll be ready to address any concerns quickly to keep round-trip time of that whole process as short as possible. After merging we'll need some follow-up PRs on top of this work (and tracked in #162) to write out recipe yaml files, add auxiliary datasets, add precomputed datasets, and remove the legacy train/messages json output formats. |
src/instructlab/sdg/generate_data.py
Outdated
return knowledge_ds | ||
|
||
|
||
def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT offers 2 possible explanations for what "raft" could mean here:
Random Augmented Feature Training (RAFT): In machine learning, "RAFT" might stand for Random Augmented Feature Training, a technique where random augmentation is applied to the features of a dataset to improve the model's robustness and performance. The function's description suggests that additional context from random samples is added, which could be a form of data augmentation.
Reference Augmented Fine-Tuning (RAFT): In NLP, RAFT might refer to adding additional context or references to enhance the training data. By incorporating context from other samples, the function could be aiming to provide a more comprehensive background, potentially improving the quality of responses generated by a language model.
Seems like it's the latter? Expanding the acronym and linking to https://aka.ms/raft-paper would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this in the spirit of that RAFT paper, but I'll give this a different name that doesn't require understanding the acronym. Something like _add_extra_contexts_to_samples
since it's adding extra contexts to each sample in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also
metadata["dataset"] += f"_raft_p{p}"
Even expanding the acronym in the docstring to explain that naming would be useful 👍
Looking into:
|
I think it's this:
|
I hit this error locally when testing this a minute ago, and making the change above fixes things on my local machine. |
Rebased from aakankshaduggal#4 aakankshaduggal#13 aakankshaduggal#14 Refactored and changed by me to be as least disruptive as possible in merging in, which means leaving the legacy train and messages outputs in place in generated_data.py. This introduces a new Recipe class that references multiple generated datasets and has logic to combine those into a single mixed dataset based on a configurable number / ratio of samples to take from each dataset. This adds new output artifacts for each data generation run: * a `node_datasets_*` subfolder that contains the raw generated samples from each taxonomy leaf node as a set of jsonl files - within here, each skill taxonomy node gets a single jsonl and knowledge nodes get two jsonl files used in separate downstream steps of training * a `skills_train_msgs_*.jsonl` and a `knowledge_train_msgs_*.jsonl` file that contains mixtures of the above raw generated samples based on proportions specified during the mixing process Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with merging this as-is when CI passes, I think the cleanups, documentation, refactoring, scope reduction, etc. has gotten to a place which this is much more maintainable going forward. Thanks, Ben!
|
||
dataset = dataset.map(_move_unallowed_cols_to_metadata, num_proc=num_proc) | ||
|
||
# check if metadata column is string if not convert it using json.dumps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to see what situation this code is trying to handle? Seems like metadata can be a dict somehow?
src/instructlab/sdg/generate_data.py
Outdated
return knowledge_ds | ||
|
||
|
||
def _build_raft_dataset(ds: Dataset, p, num_doc_in_context=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also
metadata["dataset"] += f"_raft_p{p}"
Even expanding the acronym in the docstring to explain that naming would be useful 👍
Ok, pushing the big green merge button on this for now as it's a prerequisite for a number of other PRs. Thanks for all the great feedback, and will keep cleaning this up with the work that builds on top of this. |
See #162
Rebased from aakankshaduggal#4, aakankshaduggal#13, and aakankshaduggal#14 reducing scope and splitting what can out into subsequent follow-up PRs for the overall data mixing functionality instead of landing everything as one big PR.
Refactored and changed by me to be as least disruptive as possible in merging in, which means leaving the legacy train and messages outputs in place in generated_data.py.
This introduces a new Recipe class that references multiple generated datasets and has logic to combine those into a single mixed dataset based on a configurable number / ratio of samples to take from each dataset.
This adds new output artifacts for each data generation run:
node_datasets_*
subfolder that contains the raw generated samples from each taxonomy leaf node as a set of jsonl files - within here, each skill taxonomy node gets a single jsonl and knowledge nodes get two jsonl files used in separate downstream steps of trainingskills_train_msgs_*.jsonl
and aknowledge_train_msgs_*.jsonl
file that contains mixtures of the above raw generated samples based on proportions specified during the mixing process