-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mismatch in full pipeline outputs #75
Fix mismatch in full pipeline outputs #75
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @russellb 🎉
I'll admit to being thoroughly confused, but ... Are you sure it's not "response" for the full knowledge pipeline? Where exactly in the code do you even check this for sure? I'm looking at |
Yeah ...
Yeah, it's in there ... but not at the end like you might expect. sdg/src/instructlab/sdg/default_flows.py Line 153 in bcb7974
It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there. |
but ... that's |
Ughhhhh, you're right. The original code was written against what I copied a link to (the knowledge pipeline). The test I did with @aakankshaduggal was a skills addition, which uses "answer". The real problem here is the two are different in several ways, both inputs and outputs, and I didn't notice until now. Thank you for being diligent here. |
The full knowledge pipeline had `question` and `response` as output columns, while the skills pipelines used `question` and `answer`. `generate_data.py` currently expects `response` instead of `answer`. Instead of having to deal with both, just standardize on `response`, since that seems to be used more frequently. For example, various prompt filenames have "response" in their names. Signed-off-by: Russell Bryant <[email protected]>
6d23be5
to
f37ecfc
Compare
Cool, lgtm 👍 |
I changed this PR completely. Now it addresses the real problem, which is that the knowledge vs skills pipelines had different outputs. The original PR here fixed skills, but would have broken knowledge. I'm still trying to get CI in place that can catch this stuff ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and this works, thanks @russellb
mergify: autoamtically apply backend label
f37ecfc Fix mismatch in full pipeline outputs
commit f37ecfc
Author: Russell Bryant [email protected]
Date: Wed Jul 3 11:27:38 2024 -0400