You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
generate_data.py calls leaf_node_to_samples(), which right now assumes that the expected dataset format is the same for all pipeline types. It matched the simple pipelines and the full knowledge pipeline. I didn't notice that the full skills pipeline was different. I saw that #50 changed the name of fields expected in the template, so now we have 3 different dataset formats expected by the default pipelines.
simple pipelines expect question_N, answer_N for N in (1, 2, 3)
full knowledge pipeline now expects icl_query_N / icl_reponse_Nfor N in (1, 2, 3)
full skills pipelines expect seed_question and seed_response (just 1 instead of 3)
Short term fix:
Document the dataset format expected by each pipeline.
fix the code to generate the dataset in the expected format for each pipeline.
Longer term consideration:
None of this matches the source format for the dataset (taxonomy). Allowing people to specify a custom pipeline implies specifying their expected sample dataset format somehow.
Another idea instead ...
Always assume a consistent dataset format.
Add a new pipeline capability for dataset transformation -- rename fields if you want, squash the rows into groups of 3 seed questions/answers per row (for the knowledge case)
I think something like this is going to be necessary to allow more configurable custom pipelines, as we'll need a way for a custom pipeline to declare the dataset format it is expecting from a known starting point.
The text was updated successfully, but these errors were encountered:
PR instructlab#50 changed the format used in the full knowledge pipeline. Change
the simple pipelines to match.
Part of issue instructlab#55.
Signed-off-by: Russell Bryant <[email protected]>
generate_data.py
callsleaf_node_to_samples()
, which right now assumes that the expected dataset format is the same for all pipeline types. It matched the simple pipelines and the full knowledge pipeline. I didn't notice that the full skills pipeline was different. I saw that #50 changed the name of fields expected in the template, so now we have 3 different dataset formats expected by the default pipelines.simple pipelines expect
question_N
,answer_N
forN
in(1, 2, 3)
full knowledge pipeline now expects
icl_query_N
/icl_reponse_N
forN
in(1, 2, 3)
full skills pipelines expect
seed_question
andseed_response
(just 1 instead of 3)Short term fix:
Longer term consideration:
None of this matches the source format for the dataset (taxonomy). Allowing people to specify a custom pipeline implies specifying their expected sample dataset format somehow.
Another idea instead ...
I think something like this is going to be necessary to allow more configurable custom pipelines, as we'll need a way for a custom pipeline to declare the dataset format it is expecting from a known starting point.
The text was updated successfully, but these errors were encountered: