-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial CLI integration with new SDG interfaces #46
Conversation
I'm going to try to help this along Goal: to have ilab data generate use the new SDG API (Flows, Pipelines, Blocks, etc.) and support "full SDG" (Re - "full SDG" - I'm a bit unclear how we define that precisely) Constraint: avoid causing any regression in "simple SDG" (aka "small model support") Step #1 - re-implement Step #2 - replace the call to Step #3 - add "full SDG" support in ilab data generate - e.g. by using the more advanced flows available via the new API This PR is step #1 We can merge this PR once we're confident it is in good enough shape that it doesn't regress the existing ilab data generate (unless there's a particular regression we decide we're ok with) I've been doing some hacking to get up to speed: https://github.com/russellb/sdg/compare/new-cli-integration...markmc:sdg:new-cli-integration?expand=1 Not being deeply familiar with the existing "simple SDG" flow though, I'm going to be slow in getting the new implementation up to parity. Happy to have help! |
Next steps AIUI:
|
e2e test failure is:
|
This pull request has merge conflicts that must be resolved before it can be |
1605bf3
to
a506c0a
Compare
Status notes before I log off for tonight. The Good News
Status The e2e test job runs 3 rounds of data generation: grounded skill (a skill with context), freeform skill (a skill with no context), and knowledge. The two skills steps are exiting cleanly having generated zero data. The training step that follows is consuming the data we've generated for the knowledge addition and passing. This is a great milestone, but we have some more to go. Next Top Priority
Other TODO items There are still other todo items scattered around
Post-merge TODOs for the simple workflow These are things I think can come after we merge this PR (IMO, open for discussion)
Full workflow via CLI Most of the code we've done here applies to the full workflow, we just need to instantiate a different pipeline.
|
To stop utils.py becoming a dumping ground, split the current code into json, chunking, and openai sub-modules. Signed-off-by: Mark McLoughlin <[email protected]>
Part of instructlab#11 sdg appears to be the main user of this, along with `ilab taxonomy diff`. We want to adapt the output of read_taxonomy() to be better suited to what sdg needs. This is the majority of src/instructlab/utils.py from commit commit 4737feb with read_taxonomy() and TaxonomyReadingException as the public API. Temporarily disable logging-fstring-interpolation to get lint passing. Signed-off-by: Mark McLoughlin <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
The model_family param is used to "force" a particular family, overriding what might be guessed from the model filename. Since the utils.models is copied and pasted from instructlab, let's isolate the use of utils.models to the generate_data() function so if we move the generate_data() code to instructlab we can get rid of the copy here. In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to the API. Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>
Add a `batched` constructor param. llama-cpp is still the CLI default and doesn't support batching. We don't know in this code what backend is being used, so for now just turn off batching. We need to come back around to disabling it only when we know the default llama-cpp backend is in use. Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>
_generate() returns a list of samples. The previouis code would create a list of lists when batching was turned off. This restores that case back to a list of samples. Signed-off-by: Russell Bryant <[email protected]>
4515b4f
to
197b611
Compare
The CLI's default model is quantized merlinite, and it does not seem good enough to follow the instructions in the full pipeline included in the new library. It's not doing any validation on the output, so the output is not going to be great. Then again, the output has never been great doing SDG with merlinite and the old sdg implementation. This at least keeps the ability to a basic workflow test and demo on a smaller system. Signed-off-by: Russell Bryant <[email protected]>
e931ed8
to
7e932d0
Compare
saving off the "how to test" instructions here before I redo the PR description now that it's almost ready for merge How to Test (assumes you have the # Setup a venv to work in
mkdir workspace
cd workspace
python -m venv venv
. venv/bin/activate
# Install the `ilab` CLI from main
git clone https://github.com/instructlab/instructlab
cd instructlab
pip install -e .
cd ..
# Install this PR version of instructlab-sdg
git clone https://github.com/instructlab/sdg
cd sdg
gh co 46
pip install -e .
cd ..
# Configure ilab and put a test taxonomy addition in place
# allow it to check out taxonomy for you
ilab config init
cd taxonomy
git remote add russellb https://github.com/russellb/taxonomy.git
git fetch russellb
git checkout russellb/softball
cd ..
# Download merlinite
ilab download
# Optional - serve the model in another terminal. helpful if you get server side errors and want all the logs
cd workspace
. venv/bin/activate
ilab model serve
# and now you can test sdg
ilab data generate
# Find the results in the generated/ directory
ls generated/
cat generated/generated_* |
This makes use of the new SDG API under the generate_data() method used by the CLI. It uses new simple workflows for knowlege and skills that inteded for basic usable with a small model for testing and demo purposes. The full pipelines provided in the library will only work in larger environments capable of running Mixtral-8x7b. There are still various TODOs in the code, but this is enough to start with. I'm sure we will make enhancements to these basic workflows that still work for the small environments. Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Oindrilla Chatterjee <[email protected]>
7e932d0
to
54b065a
Compare
Most of the CI jobs aren't running right now. GitHub is having problems: https://www.githubstatus.com/ |
FWIW, +1 from me if we're not aware of it causing significant regressions |
Thanks! As long as |
Add the last little bit needed to choose the right "full" pipeline for skills. I also renamed "profile" to "pipeline" to better reflect what is being selected here. The term "profile" is a bit overloaded from lots of past CLI UX discussion, so it's better not to use that here. Signed-off-by: Russell Bryant <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, +1 to get this merged and do follow-ups on this.
I'm going to merge this, but won't cut a release just yet. I want to look at knowledge document chunking in particular. I don't think I fully restored the previous behavior on that. |
…tree Proposed Taxonomy tree
General Approach
The approach taken here is to try to make the new SDG API the only interface
used as opposed to keeping the old implementation off to the side. Reasons for
this:
To prepare for this, we pulled all SDG code out of the CLI to start this
library. The interface point between the CLI and
instructlab.sdg
is thegenerate_data()
method ininstructlab.sdg.generate_data
. That is whereyou'll see most of the changes centered in this PR.
The PR changes the implementation underneath the existing
generate_data()
so we can make things work without further CLI changes.Status Summary
This has been tested with Merlinite both locally and via the
e2e
CI job. Theintent here is to replace the existing functionality. Future changes will work
on adding more extensive pipeline support for larger systems.
Changes
1e928ec Split up the utils module
39c943f Import utils.read_taxonomy() from instructlab
d886b12 Add get_model_family() from instructlab.utils
67dd057 Add model_prompt for merlinite/granite
d53a57a Add API to enable disabling batching
14d646f llmblock: fix batched=False
0207156 Add simple knowledge pipeline for use with default merlinite
9ee7f70 Use new API in generate_data
54b065a added number of iterations to generate
31ecfda Add skills variants for the full pipeline
commit 1e928ec
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 07:27:11 2024 -0400
commit 39c943f
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 06:27:21 2024 -0400
commit d886b12
Author: Russell Bryant [email protected]
Date: Tue Jun 25 17:57:24 2024 -0400
commit 67dd057
Author: Russell Bryant [email protected]
Date: Tue Jun 25 20:58:29 2024 -0400
commit d53a57a
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 05:55:32 2024 -0400
commit 14d646f
Author: Russell Bryant [email protected]
Date: Wed Jun 26 09:23:40 2024 -0400
commit 0207156
Author: Russell Bryant [email protected]
Date: Wed Jun 26 11:17:09 2024 -0400
commit 9ee7f70
Author: Russell Bryant [email protected]
Date: Tue Jun 25 16:31:19 2024 -0400
commit 54b065a
Author: Oindrilla Chatterjee [email protected]
Date: Fri Jun 28 11:52:12 2024 -0400
commit 31ecfda
Author: Russell Bryant [email protected]
Date: Fri Jun 28 16:30:24 2024 -0400