-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support converting messages datasets into multiple pre-training formats #341
Conversation
0fa9879
to
3e20653
Compare
7a42c65
to
7e344bc
Compare
@Maxusmusti Can you confirm that we don't need any |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Jaideep Rao <[email protected]>
7e344bc
to
b963c97
Compare
@bbrowning yeah you don't need to add the end_of_text token, that gets added by the chat template: https://github.com/instructlab/training/pull/319/files#diff-a8438361fc1435b584fec100fc73bd5bdc7856dc9826a570dd9dc7a6321f9bbcR30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaideepr97! LGTM
This PR adds support for converting messages datasets into multiple pre-training formats to support working with both granite 7b and granite 3.0 student models. It accepts a
use_legacy_pretraining_format
parameter as input to appropriately choose the right format to useThis is intended to be a short term solution, with the long term idea being that SDG would be agnostic of student model requirements such as these