-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include precomputed dataset and datamixing recipes #234
Comments
The only way to specify a default dataset today is to supply a default recipe yaml file for knowledge and/or skills. These would reside at a path like Once the default recipe file gets in the right place, the rest of the existing data generation code should automatically pick up and use that recipe for mixing. |
Thinking more from a user's point-of-view, is downloading one or more precomputed datasets a different task from creating a recipe that uses those datasets? Would I want to And, all of this is only relevant for users with big hardware doing the full data generation pipeline and phased training, right? Does the precomputed dataset impact the output at all for any user doing legacy training, simple pipeline, or non-phased training? |
xref #237 |
This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
This is still relevant, so commenting as such to keep from becoming stale. |
Related to #201
We decided not to move forward with the hugging face approach, but to have better results with SDG + train, we need to have a precomputed dataset that the newly generated data will be mixed with.
Couple approaches that we could take --
ilab data download
- this will pull the data from instructlab's hugging face.The text was updated successfully, but these errors were encountered: