Fix mismatch in full pipeline outputs #75

russellb · 2024-07-03T15:29:13Z

f37ecfc Fix mismatch in full pipeline outputs

commit f37ecfc
Author: Russell Bryant [email protected]
Date: Wed Jul 3 11:27:38 2024 -0400

Fix mismatch in full pipeline outputs

The full knowledge pipeline had `question` and `response` as output
columns, while the skills pipelines used `question` and `answer`.

`generate_data.py` currently expects `response` instead of `answer`.
Instead of having to deal with both, just standardize on `response`,
since that seems to be used more frequently. For example, various
prompt filenames have "response" in their names.

Signed-off-by: Russell Bryant <[email protected]>

aakankshaduggal

Thanks for the quick fix @russellb 🎉

markmc · 2024-07-03T16:20:04Z

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...

Are you sure it's not "response" for the full knowledge pipeline?

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

russellb · 2024-07-03T16:23:23Z

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...

Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

sdg/src/instructlab/sdg/default_flows.py

Line 153 in bcb7974

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

markmc · 2024-07-03T16:25:18Z

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...
Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

sdg/src/instructlab/sdg/default_flows.py

Line 153 in bcb7974

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

but ... that's response not answer?

russellb · 2024-07-03T17:24:38Z

The _get_response() method handles the differences in the expected output from the "simple" and "full" pipelines. It incorrectly assumed that the "full" pipeline provided a "response" column, but it is actually "answer".

I'll admit to being thoroughly confused, but ...
Are you sure it's not "response" for the full knowledge pipeline?

Yeah ...

Where exactly in the code do you even check this for sure? I'm looking at output_cols in the LLMBlock configs ...

Yeah, it's in there ... but not at the end like you might expect.

sdg/src/instructlab/sdg/default_flows.py

Line 153 in bcb7974

"output_cols": ["question", "response"],

It's at the beginning. The rest of the pipeline is checking it to potentially be discarded, but the columns we care about in what makes it out the other end come from there.

but ... that's response not answer?

Ughhhhh, you're right.

The original code was written against what I copied a link to (the knowledge pipeline).

The test I did with @aakankshaduggal was a skills addition, which uses "answer".

The real problem here is the two are different in several ways, both inputs and outputs, and I didn't notice until now. Thank you for being diligent here.

The full knowledge pipeline had `question` and `response` as output columns, while the skills pipelines used `question` and `answer`. `generate_data.py` currently expects `response` instead of `answer`. Instead of having to deal with both, just standardize on `response`, since that seems to be used more frequently. For example, various prompt filenames have "response" in their names. Signed-off-by: Russell Bryant <[email protected]>

markmc · 2024-07-03T17:31:27Z

Cool, lgtm 👍

russellb · 2024-07-03T17:31:47Z

I changed this PR completely.

Now it addresses the real problem, which is that the knowledge vs skills pipelines had different outputs. The original PR here fixed skills, but would have broken knowledge.

I'm still trying to get CI in place that can catch this stuff ...

oindrillac

lgtm

aakankshaduggal

Tested and this works, thanks @russellb

mergify: autoamtically apply backend label

aakankshaduggal approved these changes Jul 3, 2024

View reviewed changes

russellb force-pushed the full-output-format-fix branch from 6d23be5 to f37ecfc Compare July 3, 2024 17:30

russellb changed the title ~~Fix expected column name from full pipeline~~ Fix mismatch in full pipeline outputs Jul 3, 2024

russellb requested a review from aakankshaduggal July 3, 2024 17:32

oindrillac approved these changes Jul 3, 2024

View reviewed changes

aakankshaduggal approved these changes Jul 3, 2024

View reviewed changes

aakankshaduggal merged commit 6251693 into instructlab:main Jul 3, 2024
11 checks passed

russellb added this to the 0.1.0 milestone Jul 8, 2024

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#75 from russellb/mergify-backend-label

49ded3e

mergify: autoamtically apply backend label

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mismatch in full pipeline outputs #75

Fix mismatch in full pipeline outputs #75

russellb commented Jul 3, 2024 •

edited

Loading

aakankshaduggal left a comment

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

oindrillac left a comment

aakankshaduggal left a comment

Fix mismatch in full pipeline outputs #75

Fix mismatch in full pipeline outputs #75

Conversation

russellb commented Jul 3, 2024 • edited Loading

aakankshaduggal left a comment

Choose a reason for hiding this comment

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

markmc commented Jul 3, 2024

russellb commented Jul 3, 2024

oindrillac left a comment

Choose a reason for hiding this comment

aakankshaduggal left a comment

Choose a reason for hiding this comment

russellb commented Jul 3, 2024 •

edited

Loading