feat: support converting messages datasets into multiple pre-training formats #341

jaideepr97 · 2024-11-06T22:11:22Z

This PR adds support for converting messages datasets into multiple pre-training formats to support working with both granite 7b and granite 3.0 student models. It accepts a use_legacy_pretraining_format parameter as input to appropriately choose the right format to use

This is intended to be a short term solution, with the long term idea being that SDG would be agnostic of student model requirements such as these

src/instructlab/sdg/datamixing.py

bbrowning · 2024-11-07T18:09:53Z

@Maxusmusti Can you confirm that we don't need any <|end_of_text|> tokens in the pre-training samples here, like the chat template at https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188 uses after the message from each role? If we need to follow that format exactly, we'd also need these <|end_of_text|> tokens after the text from 1 role before starting the new role tokens?

mergify · 2024-11-07T19:08:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jaideep Rao <[email protected]>

Maxusmusti · 2024-11-07T21:37:00Z

@bbrowning yeah you don't need to add the end_of_text token, that gets added by the chat template: https://github.com/instructlab/training/pull/319/files#diff-a8438361fc1435b584fec100fc73bd5bdc7856dc9826a570dd9dc7a6321f9bbcR30

aakankshaduggal

Thanks @jaideepr97! LGTM

jaideepr97 mentioned this pull request Nov 6, 2024

feat: parametrize system prompt #339

Merged

mergify bot added the ci-failure label Nov 6, 2024

jaideepr97 requested review from bbrowning and aakankshaduggal November 6, 2024 22:12

jaideepr97 force-pushed the update-pretrain-conv branch from 0fa9879 to 3e20653 Compare November 6, 2024 22:18

mergify bot removed the ci-failure label Nov 6, 2024

jaideepr97 requested a review from khaledsulayman November 6, 2024 22:20

jaideepr97 changed the title ~~feat: support generating pre-training samples in multiple formats~~ feat: support converting messages datasets into multiple pre-training formats Nov 6, 2024

Maxusmusti reviewed Nov 6, 2024

View reviewed changes

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved

jaideepr97 force-pushed the update-pretrain-conv branch 2 times, most recently from 7a42c65 to 7e344bc Compare November 7, 2024 17:28

mergify bot added the needs-rebase label Nov 7, 2024

feat: support generating pre-training samples in multiple formats

b963c97

Signed-off-by: Jaideep Rao <[email protected]>

jaideepr97 force-pushed the update-pretrain-conv branch from 7e344bc to b963c97 Compare November 7, 2024 19:48

mergify bot removed the needs-rebase label Nov 7, 2024

aakankshaduggal approved these changes Nov 7, 2024

View reviewed changes

mergify bot added the one-approval label Nov 7, 2024

bbrowning approved these changes Nov 7, 2024

View reviewed changes

mergify bot removed the one-approval label Nov 7, 2024

khaledsulayman approved these changes Nov 7, 2024

View reviewed changes

mergify bot merged commit 7af918a into instructlab:main Nov 7, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support converting messages datasets into multiple pre-training formats #341

feat: support converting messages datasets into multiple pre-training formats #341

jaideepr97 commented Nov 6, 2024 •

edited

Loading

bbrowning commented Nov 7, 2024

mergify bot commented Nov 7, 2024

Maxusmusti commented Nov 7, 2024

aakankshaduggal left a comment

feat: support converting messages datasets into multiple pre-training formats #341

feat: support converting messages datasets into multiple pre-training formats #341

Conversation

jaideepr97 commented Nov 6, 2024 • edited Loading

bbrowning commented Nov 7, 2024

mergify bot commented Nov 7, 2024

Maxusmusti commented Nov 7, 2024

aakankshaduggal left a comment

Choose a reason for hiding this comment

jaideepr97 commented Nov 6, 2024 •

edited

Loading