Include precomputed dataset and datamixing recipes #234

aakankshaduggal · 2024-08-05T17:35:17Z

Related to #201

We decided not to move forward with the hugging face approach, but to have better results with SDG + train, we need to have a precomputed dataset that the newly generated data will be mixed with.

Couple approaches that we could take --

Support ilab data download - this will pull the data from instructlab's hugging face.
Allow users to store their own precomputed dataset at a defined path.
If none of these are defined, either no mixing happens or have a default download from hugging face.

The text was updated successfully, but these errors were encountered:

bbrowning · 2024-08-05T19:01:41Z

The only way to specify a default dataset today is to supply a default recipe yaml file for knowledge and/or skills. These would reside at a path like /usr/share/instructlab/sdg/default_data_recipes/skills.yaml, ~/.local/share/instructlab/sdg/default_data_recipes/skills.yaml, etc (where the exact path is system-dependent, from platformdir.PlatformDirs). So, a user could do this today by hand-writing a default recipe at the correct path. Or, something like ilab data download could download that dataset from HuggingFace, place it into an appropriate path, and then write out a default recipe that references it.

Once the default recipe file gets in the right place, the rest of the existing data generation code should automatically pick up and use that recipe for mixing.

bbrowning · 2024-08-05T19:19:23Z

Thinking more from a user's point-of-view, is downloading one or more precomputed datasets a different task from creating a recipe that uses those datasets? Would I want to ilab data download <some other HF dataset>, just like I can download different models? Where do those datasets get stored on disk when I do so? Once I've downloaded them, how do I generate a recipe to use them? How do I pass my custom dataset and/or recipe into ilab data generate?

And, all of this is only relevant for users with big hardware doing the full data generation pipeline and phased training, right? Does the precomputed dataset impact the output at all for any user doing legacy training, simple pipeline, or non-phased training?

markmc · 2024-08-09T13:55:52Z

xref #237

github-actions · 2024-11-19T02:02:48Z

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

bbrowning · 2024-11-20T13:24:45Z

This is still relevant, so commenting as such to keep from becoming stale.

nathan-weinberg added the enhancement New feature or request label Aug 20, 2024

github-actions bot added the stale label Nov 19, 2024

github-actions bot removed the stale label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include precomputed dataset and datamixing recipes #234

Include precomputed dataset and datamixing recipes #234

aakankshaduggal commented Aug 5, 2024 •

edited

Loading

bbrowning commented Aug 5, 2024

bbrowning commented Aug 5, 2024

markmc commented Aug 9, 2024

github-actions bot commented Nov 19, 2024

bbrowning commented Nov 20, 2024

Include precomputed dataset and datamixing recipes #234

Include precomputed dataset and datamixing recipes #234

Comments

aakankshaduggal commented Aug 5, 2024 • edited Loading

bbrowning commented Aug 5, 2024

bbrowning commented Aug 5, 2024

markmc commented Aug 9, 2024

github-actions bot commented Nov 19, 2024

bbrowning commented Nov 20, 2024

aakankshaduggal commented Aug 5, 2024 •

edited

Loading