Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mismatch in full pipeline outputs #75

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

russellb
Copy link
Member

@russellb russellb commented Jul 3, 2024

f37ecfc Fix mismatch in full pipeline outputs

commit f37ecfc
Author: Russell Bryant [email protected]
Date: Wed Jul 3 11:27:38 2024 -0400

Fix mismatch in full pipeline outputs

The full knowledge pipeline had `question` and `response` as output
columns, while the skills pipelines used `question` and `answer`.

`generate_data.py` currently expects `response` instead of `answer`.
Instead of having to deal with both, just standardize on `response`,
since that seems to be used more frequently. For example, various
prompt filenames have "response" in their names.

Signed-off-by: Russell Bryant <[email protected]>

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @russellb 🎉

@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...

Are you sure it's not "response" for the full knowledge pipeline?

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

@russellb
Copy link
Member Author

russellb commented Jul 3, 2024

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...

Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...
Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

but ... that's response not answer?

@russellb
Copy link
Member Author

russellb commented Jul 3, 2024

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...
Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

but ... that's response not answer?

Ughhhhh, you're right.

The original code was written against what I copied a link to (the knowledge pipeline).

The test I did with @aakankshaduggal was a skills addition, which uses "answer".

The real problem here is the two are different in several ways, both inputs and outputs, and I didn't notice until now. Thank you for being diligent here.

The full knowledge pipeline had `question` and `response` as output
columns, while the skills pipelines used `question` and `answer`.

`generate_data.py` currently expects `response` instead of `answer`.
Instead of having to deal with both, just standardize on `response`,
since that seems to be used more frequently. For example, various
prompt filenames have "response" in their names.

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb force-pushed the full-output-format-fix branch from 6d23be5 to f37ecfc Compare July 3, 2024 17:30
@russellb russellb changed the title Fix expected column name from full pipeline Fix mismatch in full pipeline outputs Jul 3, 2024
@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

Cool, lgtm 👍

@russellb
Copy link
Member Author

russellb commented Jul 3, 2024

I changed this PR completely.

Now it addresses the real problem, which is that the knowledge vs skills pipelines had different outputs. The original PR here fixed skills, but would have broken knowledge.

I'm still trying to get CI in place that can catch this stuff ...

@russellb russellb requested a review from aakankshaduggal July 3, 2024 17:32
Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and this works, thanks @russellb

@aakankshaduggal aakankshaduggal merged commit 6251693 into instructlab:main Jul 3, 2024
11 checks passed
@russellb russellb added this to the 0.1.0 milestone Jul 8, 2024
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
mergify: autoamtically apply backend label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants