Initial CLI integration with new SDG interfaces #46

russellb · 2024-06-26T18:18:45Z

General Approach

The approach taken here is to try to make the new SDG API the only interface
used as opposed to keeping the old implementation off to the side. Reasons for
this:

less to maintain
we'll be able to clean up CLI options to only include ones relevant to the new library

To prepare for this, we pulled all SDG code out of the CLI to start this
library. The interface point between the CLI and instructlab.sdg is the
generate_data() method in instructlab.sdg.generate_data. That is where
you'll see most of the changes centered in this PR.

The PR changes the implementation underneath the existing
generate_data() so we can make things work without further CLI changes.

Status Summary

This has been tested with Merlinite both locally and via the e2e CI job. The
intent here is to replace the existing functionality. Future changes will work
on adding more extensive pipeline support for larger systems.

Changes

1e928ec Split up the utils module
39c943f Import utils.read_taxonomy() from instructlab
d886b12 Add get_model_family() from instructlab.utils
67dd057 Add model_prompt for merlinite/granite
d53a57a Add API to enable disabling batching
14d646f llmblock: fix batched=False
0207156 Add simple knowledge pipeline for use with default merlinite
9ee7f70 Use new API in generate_data
54b065a added number of iterations to generate
31ecfda Add skills variants for the full pipeline

commit 1e928ec
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 07:27:11 2024 -0400

Split up the utils module

To stop utils.py becoming a dumping ground, split the current code
into json, chunking, and openai sub-modules.

Signed-off-by: Mark McLoughlin <[email protected]>

commit 39c943f
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 06:27:21 2024 -0400

Import utils.read_taxonomy() from instructlab

Part of #11

sdg appears to be the main user of this, along with `ilab taxonomy diff`.

We want to adapt the output of read_taxonomy() to be better suited to
what sdg needs.

This is the majority of src/instructlab/utils.py from commit commit 4737feb
with read_taxonomy() and TaxonomyReadingException as the public API.

Temporarily disable logging-fstring-interpolation to get lint passing.

Signed-off-by: Mark McLoughlin <[email protected]>

commit d886b12
Author: Russell Bryant [email protected]
Date: Tue Jun 25 17:57:24 2024 -0400

Add get_model_family() from instructlab.utils

Signed-off-by: Russell Bryant <[email protected]>

commit 67dd057
Author: Russell Bryant [email protected]
Date: Tue Jun 25 20:58:29 2024 -0400

Add model_prompt for merlinite/granite

The model_family param is used to "force" a particular family,
overriding what might be guessed from the model filename.

Since the utils.models is copied and pasted from instructlab,
let's isolate the use of utils.models to the generate_data()
function so if we move the generate_data() code to instructlab
we can get rid of the copy here.

In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to
the API.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>

commit d53a57a
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 05:55:32 2024 -0400

Add API to enable disabling batching

Add a `batched` constructor param.

llama-cpp is still the CLI default and doesn't support batching. We
don't know in this code what backend is being used, so for now just
turn off batching. We need to come back around to disabling it only
when we know the default llama-cpp backend is in use.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>

commit 14d646f
Author: Russell Bryant [email protected]
Date: Wed Jun 26 09:23:40 2024 -0400

llmblock: fix batched=False

_generate() returns a list of samples. The previouis code would create
a list of lists when batching was turned off. This restores that case
back to a list of samples.

Signed-off-by: Russell Bryant <[email protected]>

commit 0207156
Author: Russell Bryant [email protected]
Date: Wed Jun 26 11:17:09 2024 -0400

Add simple knowledge pipeline for use with default merlinite

The CLI's default model is quantized merlinite, and it does not seem
good enough to follow the instructions in the full pipeline included
in the new library.

It's not doing any validation on the output, so the output is not
going to be great.  Then again, the output has never been great doing
SDG with merlinite and the old sdg implementation. This at least keeps
the ability to a basic workflow test and demo on a smaller system.

Signed-off-by: Russell Bryant <[email protected]>

commit 9ee7f70
Author: Russell Bryant [email protected]
Date: Tue Jun 25 16:31:19 2024 -0400

Use new API in generate_data

This makes use of the new SDG API under the generate_data()
method used by the CLI. It uses new simple workflows for knowlege and
skills that inteded for basic usable with a small model for testing
and demo purposes. The full pipelines provided in the library will
only work in larger environments capable of running Mixtral-8x7b.

There are still various TODOs in the code, but this is enough to
start with. I'm sure we will make enhancements to these basic
workflows that still work for the small environments.

Signed-off-by: Russell Bryant <[email protected]>

commit 54b065a
Author: Oindrilla Chatterjee [email protected]
Date: Fri Jun 28 11:52:12 2024 -0400

added number of iterations to generate

Signed-off-by: Oindrilla Chatterjee <[email protected]>

commit 31ecfda
Author: Russell Bryant [email protected]
Date: Fri Jun 28 16:30:24 2024 -0400

Add skills variants for the full pipeline

Add the last little bit needed to choose the right "full" pipeline for
skills.

I also renamed "profile" to "pipeline" to better reflect what is being
selected here. The term "profile" is a bit overloaded from lots of
past CLI UX discussion, so it's better not to use that here.

Signed-off-by: Russell Bryant <[email protected]>

src/instructlab/sdg/utils.py

src/instructlab/sdg/default_flows.py

markmc · 2024-06-27T12:22:32Z

I'm going to try to help this along

Goal: to have ilab data generate use the new SDG API (Flows, Pipelines, Blocks, etc.) and support "full SDG"

(Re - "full SDG" - I'm a bit unclear how we define that precisely)

Constraint: avoid causing any regression in "simple SDG" (aka "small model support")

Step #1 - re-implement instructlab.sdg.generate_data.generate_data() using the new API - i.e. model the existing "simple SDG" using the new API, and then re-write generate_data() to use it

Step #2 - replace the call to instructlab.sdg.generate_data.generate_data() in instructlab/data/generate.py with calls to the new API - i.e. move the relatively simple code in generate_data() back into instructlab

Step #3 - add "full SDG" support in ilab data generate - e.g. by using the more advanced flows available via the new API

This PR is step #1

We can merge this PR once we're confident it is in good enough shape that it doesn't regress the existing ilab data generate (unless there's a particular regression we decide we're ok with)

I've been doing some hacking to get up to speed: https://github.com/russellb/sdg/compare/new-cli-integration...markmc:sdg:new-cli-integration?expand=1

Not being deeply familiar with the existing "simple SDG" flow though, I'm going to be slow in getting the new implementation up to parity. Happy to have help!

markmc · 2024-06-27T12:34:16Z

Next steps AIUI:

verify generate_data() is producing output in the same format as before
create and incorporate workflows for freeform and grounded skills
audit the CLI options to help identify any functional regressions
get e2e passing

markmc · 2024-06-27T12:46:42Z

e2e test failure is:

Generate synthetic data - Wed Jun 26 18:32:00 UTC 2024
...
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:8000/v1 server
INFO 2024-06-26 18:32:13,652 generate_data.py:222: generate_data Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-06-26 18:32:13,670 pipeline.py:48: generate Running block: gen_knowledge
INFO 2024-06-26 18:32:13,670 pipeline.py:49: generate Dataset({
    features: ['task_description', 'document', 'domain', 'question_1', 'response_1', 'question_2', 'response_2', 'question_3', 'response_3'],
    num_rows: 6
})
INFO 2024-06-26 18:32:56,378 generate_data.py:276: generate_data Generated 5 samples
INFO 2024-06-26 18:32:56,379 generate_data.py:282: generate_data Generation took 49.68s
...
Train the model - Wed Jun 26 18:32:57 UTC 2024
...
generated does not contain training or test files, did you run `ilab generate`?
Error: Process completed with exit code 1.

mergify · 2024-06-27T22:01:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. @russellb please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · 2024-06-28T03:24:50Z

Status notes before I log off for tonight.

The Good News

e2e is green!

Status

The e2e test job runs 3 rounds of data generation: grounded skill (a skill with context), freeform skill (a skill with no context), and knowledge. The two skills steps are exiting cleanly having generated zero data. The training step that follows is consuming the data we've generated for the knowledge addition and passing.

This is a great milestone, but we have some more to go.

Next Top Priority

top priority: this PR includes a "simple" knowledge pipeline, but we need to add similar ones that handle the two types of skills. Now that the knowledge one works, I'm hoping the other two come in a lot quicker, as they will follow a very similar pattern.
- need a file similar to src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml but tweaked a bit. there's no document. The grounded skill one will be similar in that we'll be giving a context blob similar to how it gives a document for knowledge. The freeform skill one will not have context or document.
- in default_flows.py, in addition to the SimpleKnowledgeFlow that's there, we need one or two skills flows that are the simple one-stage things
- generate_data.py needs updates to make use of these new flows

Other TODO items

There are still other todo items scattered around generate_data.py

we still don't handle the num-instructions parameter. @oindrillac was looking at that today.
there are other parameters still being ignored. I tried to notate what I thought should happen to them, but I did it quickly. There may be other parameters we need to be acting on.

Post-merge TODOs for the simple workflow

These are things I think can come after we merge this PR (IMO, open for discussion)

chunking is an overly simplified thing right now. It just truncates the document. It's not trying to iterate over multiple chunks. We also have an open PR working on a smarter chunking mechanism. Updated Chunking Methods #45 -- doing something to process more chunks could possibly be a follow-up issue instead of blocking this PR.
There may be some CLI options that can be removed - TBD
we automatically turn batching off. we should turn it on when vllm is used, or we're talking to an arbitrary OpenAI compatible API endpoint. In other words, we should default to on unless the CLI tells us to turn it off because llama-cpp is in use. This is an optimization and not a functional blocker so I think a follow-up issue is OK.

Full workflow via CLI

Most of the code we've done here applies to the full workflow, we just need to instantiate a different pipeline.

we need a CLI option for selecting the "full" pipeline vs the simple one. This may justify a quick design doc in instructlab/dev-docs just to clarify the UX. --pipeline simple and --pipeline full might be enough, but there was a LOT of debate over a very similar set of options proposed for training that never got full support. That may be different though because training has so many variables.
we are going to want a new CLI option (or options) for adding custom pipeline configuration that can be tagged on to the end of the chosen pipeline. This is the most significant area of UX impact to very quickly sort out.
we don't have any CI environment capable of testing this. It's probably about time we figure that out.

To stop utils.py becoming a dumping ground, split the current code into json, chunking, and openai sub-modules. Signed-off-by: Mark McLoughlin <[email protected]>

Part of instructlab#11 sdg appears to be the main user of this, along with `ilab taxonomy diff`. We want to adapt the output of read_taxonomy() to be better suited to what sdg needs. This is the majority of src/instructlab/utils.py from commit commit 4737feb with read_taxonomy() and TaxonomyReadingException as the public API. Temporarily disable logging-fstring-interpolation to get lint passing. Signed-off-by: Mark McLoughlin <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

The model_family param is used to "force" a particular family, overriding what might be guessed from the model filename. Since the utils.models is copied and pasted from instructlab, let's isolate the use of utils.models to the generate_data() function so if we move the generate_data() code to instructlab we can get rid of the copy here. In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to the API. Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>

Add a `batched` constructor param. llama-cpp is still the CLI default and doesn't support batching. We don't know in this code what backend is being used, so for now just turn off batching. We need to come back around to disabling it only when we know the default llama-cpp backend is in use. Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>

_generate() returns a list of samples. The previouis code would create a list of lists when batching was turned off. This restores that case back to a list of samples. Signed-off-by: Russell Bryant <[email protected]>

The CLI's default model is quantized merlinite, and it does not seem good enough to follow the instructions in the full pipeline included in the new library. It's not doing any validation on the output, so the output is not going to be great. Then again, the output has never been great doing SDG with merlinite and the old sdg implementation. This at least keeps the ability to a basic workflow test and demo on a smaller system. Signed-off-by: Russell Bryant <[email protected]>

russellb · 2024-06-28T19:22:05Z

saving off the "how to test" instructions here before I redo the PR description now that it's almost ready for merge

How to Test

(assumes you have the gh CLI installed and configured) -- https://cli.github.com/

# Setup a venv to work in
mkdir workspace
cd workspace
python -m venv venv
. venv/bin/activate

# Install the `ilab` CLI from main
git clone https://github.com/instructlab/instructlab
cd instructlab
pip install -e .
cd ..

# Install this PR version of instructlab-sdg
git clone https://github.com/instructlab/sdg
cd sdg
gh co 46
pip install -e .
cd ..

# Configure ilab and put a test taxonomy addition in place
# allow it to check out taxonomy for you
ilab config init
cd taxonomy
git remote add russellb https://github.com/russellb/taxonomy.git
git fetch russellb
git checkout russellb/softball
cd ..

# Download merlinite
ilab download

# Optional - serve the model in another terminal.  helpful if you get server side errors and want all the logs
cd workspace
. venv/bin/activate
ilab model serve

# and now you can test sdg
ilab data generate

# Find the results in the generated/ directory
ls generated/
cat generated/generated_*

This makes use of the new SDG API under the generate_data() method used by the CLI. It uses new simple workflows for knowlege and skills that inteded for basic usable with a small model for testing and demo purposes. The full pipelines provided in the library will only work in larger environments capable of running Mixtral-8x7b. There are still various TODOs in the code, but this is enough to start with. I'm sure we will make enhancements to these basic workflows that still work for the small environments. Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Oindrilla Chatterjee <[email protected]>

russellb · 2024-06-28T19:32:32Z

Most of the CI jobs aren't running right now. GitHub is having problems: https://www.githubstatus.com/

markmc · 2024-06-28T19:55:18Z

FWIW, +1 from me if we're not aware of it causing significant regressions

russellb · 2024-06-28T20:11:52Z

FWIW, +1 from me if we're not aware of it causing significant regressions

Thanks! As long as e2e still passes, I'm good with this going in and then we continue to iterate from here. It won't actually get picked up by anyone else until we tag a library release, anyway.

Add the last little bit needed to choose the right "full" pipeline for skills. I also renamed "profile" to "pipeline" to better reflect what is being selected here. The term "profile" is a bit overloaded from lots of past CLI UX discussion, so it's better not to use that here. Signed-off-by: Russell Bryant <[email protected]>

oindrillac

Tested, +1 to get this merged and do follow-ups on this.

russellb · 2024-06-28T21:18:53Z

I'm going to merge this, but won't cut a release just yet. I want to look at knowledge document chunking in particular. I don't think I fully restored the previous behavior on that.

…tree Proposed Taxonomy tree

mergify bot added the ci-failure label Jun 26, 2024

russellb commented Jun 26, 2024

View reviewed changes

src/instructlab/sdg/utils.py Outdated Show resolved Hide resolved

oindrillac reviewed Jun 26, 2024

View reviewed changes

src/instructlab/sdg/default_flows.py Outdated Show resolved Hide resolved

mergify bot added ci-failure CI/CD Affects CI/CD configuration testing Relates to testing and removed ci-failure labels Jun 27, 2024

mergify bot added needs-rebase and removed ci-failure labels Jun 27, 2024

russellb force-pushed the new-cli-integration branch from 1605bf3 to a506c0a Compare June 27, 2024 23:41

mergify bot added ci-failure and removed needs-rebase ci-failure labels Jun 27, 2024

markmc and others added 6 commits June 28, 2024 05:16

Split up the utils module

1e928ec

To stop utils.py becoming a dumping ground, split the current code into json, chunking, and openai sub-modules. Signed-off-by: Mark McLoughlin <[email protected]>

Add get_model_family() from instructlab.utils

d886b12

Signed-off-by: Russell Bryant <[email protected]>

llmblock: fix batched=False

14d646f

_generate() returns a list of samples. The previouis code would create a list of lists when batching was turned off. This restores that case back to a list of samples. Signed-off-by: Russell Bryant <[email protected]>

markmc force-pushed the new-cli-integration branch from 4515b4f to 197b611 Compare June 28, 2024 10:42

russellb mentioned this pull request Jun 28, 2024

Import functions from the ilab repository #11

Closed

9 tasks

markmc mentioned this pull request Jun 28, 2024

POC: move generate_data() back from the SDG library instructlab/instructlab#1519

Closed

russellb force-pushed the new-cli-integration branch from e931ed8 to 7e932d0 Compare June 28, 2024 19:14

russellb changed the title ~~CLI integration with new SDG interfaces~~ Initial CLI integration with new SDG interfaces Jun 28, 2024

russellb marked this pull request as ready for review June 28, 2024 19:24

russellb and others added 2 commits June 28, 2024 15:29

added number of iterations to generate

54b065a

Signed-off-by: Oindrilla Chatterjee <[email protected]>

russellb force-pushed the new-cli-integration branch from 7e932d0 to 54b065a Compare June 28, 2024 19:29

oindrillac approved these changes Jun 28, 2024

View reviewed changes

russellb merged commit 1f71fb6 into instructlab:main Jun 28, 2024
11 checks passed

russellb mentioned this pull request Jun 28, 2024

Ensure "simple" pipeline does not regress knowledge document chunking behavior #52

Closed

markmc mentioned this pull request Jul 3, 2024

reduce number of arguments in openai_completion and openai_completion #37

Closed

russellb added this to the 0.1.0 milestone Jul 8, 2024

markmc mentioned this pull request Jul 16, 2024

Remove unused requirements #152

Merged

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#46 from instructlab/jjasghar/taxonomy_…

ba57807

…tree Proposed Taxonomy tree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial CLI integration with new SDG interfaces #46

Initial CLI integration with new SDG interfaces #46

russellb commented Jun 26, 2024 •

edited

Loading

markmc commented Jun 27, 2024 •

edited

Loading

markmc commented Jun 27, 2024

markmc commented Jun 27, 2024

mergify bot commented Jun 27, 2024

russellb commented Jun 28, 2024

russellb commented Jun 28, 2024

russellb commented Jun 28, 2024

markmc commented Jun 28, 2024

russellb commented Jun 28, 2024

oindrillac left a comment

russellb commented Jun 28, 2024

Initial CLI integration with new SDG interfaces #46

Initial CLI integration with new SDG interfaces #46

Conversation

russellb commented Jun 26, 2024 • edited Loading

markmc commented Jun 27, 2024 • edited Loading

markmc commented Jun 27, 2024

markmc commented Jun 27, 2024

mergify bot commented Jun 27, 2024

russellb commented Jun 28, 2024

russellb commented Jun 28, 2024

russellb commented Jun 28, 2024

markmc commented Jun 28, 2024

russellb commented Jun 28, 2024

oindrillac left a comment

Choose a reason for hiding this comment

russellb commented Jun 28, 2024

russellb commented Jun 26, 2024 •

edited

Loading

markmc commented Jun 27, 2024 •

edited

Loading