RAG ingestion and chat pipelines #161

dmartinol · 2024-12-02T14:52:22Z

Introducing ilab commands changes to support the RAG ingestion and chat pipelines:

RAG conversion: a new command to process customer documentation. Either from knowledge taxonomy or from actual user documents.
RAG ingestion: a new command to generate and ingest embeddings from pre-processed documents into a configured vector store.
RAG chat: augment the context of the chat command with the result of a similarity search executed on the ingested database, to increase the accuracy of the response.

anastasds · 2024-12-02T18:57:15Z

I think that the chat / options probably deserve some discussion and also look like they may not be a priority, but, other than that, this looks reasonable to me.

docs/cli/ilab-rag-retrieval.md

franciscojavierarceo · 2024-12-03T16:22:02Z

docs/cli/ilab-rag-retrieval.md

+allowed. Therefore, we also propose alternative approaches to run the same RAG pipelines using existing `ilab` commands or
+other provided tools.
+
+### 3.1 RAG Ingestion Pipeline Command


How will this be handled for fine tuning?

I would expect that it ends up that many of the components end up being re-invented here.

Are you asking about fine-tuning of the embedding model or the response-generation model or both?

Here is the plan regarding fine-tuning of the response-generation model:

There are no plans to make changes to the existing capability in InstructLab for synthetic data generation (SDG) and fine-tuning the response-generation model from that synthetic data.

That existing capability includes a preprocessing step that is part of the ilab data generate command which fetches source documents (e.g., PDF files) and processes them using docling.

In RAG ingestion and chat pipelines #161 we propose to separate that preprocessing into its own step.

The outputs of that step will be used as inputs for the capabilities in RAG for vectorizing and indexing that same content (the source documents).

Ideally there will also be some way to put documents in directly without having to run the SDG preprocessing, but that is lower priority than just getting the primary flow working.

Fine-tuning the embedding model is out of scope for the MVP, but in the future I think we expect that the outputs of SDG would also be useful as training data for an embedding model (e.g., a cross-encoder model that really needs query / response pairs for fine-tuning). Alternatively, maybe we just use the extracted text for fine tuning a basic single-text encoder.

docs/cli/ilab-rag-retrieval.md

franciscojavierarceo · 2024-12-03T16:34:17Z

docs/cli/ilab-rag-retrieval.md

+| **TODO** evaluation framework options | | | |
+
+Equivalent YAML document for the newly proposed options:
+```yaml


This looks like it could pretty easily be structured in Feast.

jwm4

This document is a good start, but needs a lot more input from a lot of stakeholders, especially stakeholders who work on the existing command-line interface.

jwm4 · 2024-12-03T16:56:20Z

docs/cli/ilab-rag-retrieval.md

+  * Internal Red Hat CI systems for products or services (e.g., Lightspeed products)
+
+## 3. Proposed Commands
+**Note**: In the context of version 1.4, currently under development, no changes to the command-line interface should be


Can you remove the references to "version 1.4"? This is a version number for a downstream consumer of this open source project so it doesn't belong in a dev-doc for the open source project. If you want to discuss downstream consumers, there are other venues for that.

Also, I think the rest of this note should be dropped too. It matches what I originally believed was a hard constraint, but now I am hearing that this constraint is being considered and also that there is a hard constraint that we not provide alternative approaches to run the same RAG pipelines, so really we need more discussion to find out which constraints are really hard and which are not.

jwm4 · 2024-12-03T17:06:48Z

docs/cli/ilab-rag-retrieval.md

+### 3.1 RAG Ingestion Pipeline Command
+The proposal is to add a `rag` subgroup under the `data` group, with an `ingest` command, like:
+```
+ilab data rag ingest /path/to/docs/folder


Some thoughts on this:

I guess broadly speaking, I was expecting the proposal for how this should be reflected in the command-line interface to come from members of the engine team, e.g., @cdoern . However, I guess it is fine for us to propose things here and iterate with them.

I don't like rag ingest here. I think we want something that describes what we're doing here, which is building an index rather than bringing in the term "RAG" which describes the feature but not really what this specific step is doing.

I'm not sure how to respond to the /path/to/docs/folder part. We definitely want some sort of affordance around a flow where you do an ilab data generate and then ilab just knows where the outputs of that step are rather than you needing to specify it. However, some other affordance for being able to override that location also makes sense to me. So maybe if the folder is optional that solves this?

We need to figure out how this fits in with the broader refactor being considered in Refactor preprocessing and postprocessing in SDG #155

Also, I would like a flow some day where you can just point this step at source documents and it runs docling for you, but that's lower priority than the flow that is more tightly connected with SDG (or at least SDG preprocessing).

I don't like rag ingest here

Me neither, but I was waiting for the closure of the discussion on the related command at Knowledge doc ingestion #148 that, IIUC, should be the preliminary step before running the embedding ingestion. Depending on the selected verb, we can update this proposal accordingly (maybe, like ilab data index or ilab data generate index?).

I'm not sure how to respond to the /path/to/docs/folder part

Again, this followed the proposal for the other PR, that has both --input and --output options.

We definitely want some sort of affordance around a flow where you do an ilab data generate and then ilab just knows where the outputs of that step are rather than you needing to specify it

If this is a valid use case, then yes and the parameter will be optional. We have to think carefully of how to auto-detect the json docs in this case, as the datasets folder is "versioned" for each data generate execution, so I assume the requirement is to pick all the files from the latest documents-* subfolder.

docs/cli/ilab-rag-retrieval.md

jwm4 · 2024-12-03T17:08:37Z

docs/cli/ilab-rag-retrieval.md

+add a reference document to the `qna.yaml` document(s).
+
+#### Supported Databases
+The command supports multiple vector database types. By default, it uses a local `MilvusLite` instance stored at `./rag-output.db`.


I think we should have a separate dev-doc for this decision.

docs/cli/ilab-rag-retrieval.md

anastasds · 2024-12-03T17:21:15Z

@jwm4

This document is a good start, but needs a lot more input from a lot of stakeholders

I think I need to clarify my position when I commented with a general "looks good to me" - I think that you raise some very valid points, but also I am operating under the assumption that this document serves as a proposal for "directionally where to head right now" and that any less-than-high-level details can, will, and probably should change as we understand the problem domain better during execution. I think that a useful modus operandi is to get a few key stakeholders to give a general approval and that that is enough to get started, and then have continuous feedback cycles all the time going forward to course correct as necessary. Analysis paralysis is a real effect that is best avoided.

Not trying to beat a dead horse but this is partly why I keep advocating for atomic decision records like ADRs over all-encompassing design docs like this. A general development roadmap is a necessary thing to have, but nobody will ever have enough information to design a full system specification, especially in the context of a marketplace and a large development organization. The only constant is change.

dmartinol · 2024-12-05T18:03:12Z

I will soon publish an updated version with the outcome of the discussion with the ilab Runtime (aka CLI) team.

dmartinol · 2024-12-06T17:33:51Z

@cdoern Could you please TAL and involve relevant people?

anastasds · 2024-12-06T18:04:05Z

docs/cli/ilab-rag-retrieval.md

+transformation, leveraging on the `instructlab-sdg` modules. 
+
+### Why We Need It
+This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document,


This is a really far-reaching design decision that could have a lot of consequences for the product. Looks like this came out of a meeting with the engine runtime team, was that recorded?

recording link shared on Slack channel

anastasds · 2024-12-06T18:04:46Z

docs/cli/ilab-rag-retrieval.md

+InstructLab technology stack.
+
+#### Usage
+The generated embeddings can later be retrieved to enrich the context for RAG-based chat pipelines.


The embeddings themselves, not text to be substituted into a prompt template?

anastasds · 2024-12-06T18:05:13Z

docs/cli/ilab-rag-retrieval.md

+
+| Option Description | Default Value | CLI Flag | Environment Variable |
+|--------------------|---------------|----------|----------------------|
+| Whether to include a transformation step. | `False` | `--transform` (boolean) | `ILAB_TRANSFORM` |


What would some examples of transformations be?

ilab data process --rag input runs the embedding pipeline: fetches pre-processed docs from input folder and stores the generated embeddings in the configured vector store.

ilab data process --rag --transform --transform-output processed input runs the transformation pipeline before: fetches user docs from input folder, processes them using sdg transformation in processed folder, then run the previous pipeline from this folder.

anastasds · 2024-12-06T18:06:16Z

docs/cli/ilab-rag-retrieval.md

+|--------------------|---------------|----------|----------------------|
+| Whether to include a transformation step. | `False` | `--transform` (boolean) | `ILAB_TRANSFORM` |
+| The output path of transformed documents (serve as input for the embedding ingestion pipeline). Mandatory when `--transform` is used. |  | `--transform-output` | `ILAB_TRANSFORM_OUTPUT` |
+| How to split the documents. One of `page`, `passage`, `sentence`, `word`, `line` | `word` | `--splitter-split-by` | `ILAB_SPLITTER_SPLIT_BY` |


Has there been discussion about the existence of this logic with respect to the docling-based document transformation?

docling chunkers haven't yet been integrated into Haystack. Instead, we took the DocumentSplitter options from Haystack.
I agree that in the interim we should try to not introduce framework dependencies, so I'd remove them and use default settings for now. In the meantime, we can start exploring the docling chunkers. WDYT?

You may have seen on Slack that the Docling hybrid chunker is now released. I looked at the code briefly, and it looks good to me. More details:

Announcement

Documentation

Code

I think this will be a good fit for the RAG chunking because (unlike their older hierarchical chunker), it provides chunks that are constrained to be no bigger than a fixed size for a given tokenizer and tries to make the chunks as big as possible within that size limit and the constraints of the structure.

I will sync up with ET engineers to adopt the same approach once available.

anastasds · 2024-12-06T18:07:28Z

docs/cli/ilab-rag-retrieval.md

+| Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` |
+| Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` |
+| Name of the embedding model. | `sentence-transformers/all-minilm-l6-v2` | `--model` | `ILAB_EMBEDDING_MODEL_NAME` |
+| Token to download private models. |  | `--model-token` | `ILAB_EMBEDDING_MODEL_TOKEN` |


This would introduce model downloading logic in a new place while ilab model download already exists.

Ah, good point! I hadn't thought of that. Using the existing model download sounds better to me.

Sounds good to me.
You mean that the user should first ilab model download -rp sentence-transformers/all-minilm-l6-v2 and the RAG pipelines (both) would validate that a local download exists for that model before proceeding?
Or would the RAG pipelines use the ilab download function to download the model locally?

Perhaps the both most convenient and flexible solution would be to package in a default model but also allow downloading and configuring the use of a different model by doing ilab model download... and configuring it to be used.

So it's a user responsibility to download it first 👍
I'll add a note and also introduce a --model-dir option to supply a configurable location for looking for downloaded models

anastasds · 2024-12-06T18:08:49Z

docs/cli/ilab-rag-retrieval.md

+| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` |
+| Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |
+| Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` |
+| Vector DB connection token. | | `--vectordb-token` | `ILAB_VECTORDB_TOKEN` |


How does this differ from username/password?

You are right: I took token from other document store examples in Haystack, but I agree it's better to focus on Milvus only for now.
WDYT about dropping the authentication part for now and then review the decision once we define the supported stores and verify the available authentication methods?

Milvus seems to offer authentication via username and password, but other stores have a different authn method, or no authn at all (e.g. Chroma).
If we want to be more generic, what about a single --vectordb-authentication option where we can put a comma-separated list of the store-specific settings?
E.g., for Milvus it would be:

ilab data process --rag --vectordb-type milvus --vectordb-uri 'http://localhost:1234' \ --vectordb-authentication 'username=$MILVUS_USER,password=$MILVUS_PASSWORD'

WDYT about dropping the authentication part for now and then review the decision once we define the supported stores and verify the available authentication methods?

Shipping something minimal that works and expanding on it in a feedback cycle sounds good to me!

anastasds · 2024-12-06T18:09:21Z

docs/cli/ilab-rag-retrieval.md

+| Maximum number of units in each split. | `200` | `--splitter-split-length` | `ILAB_SPLITTER_SPLIT_LENGTH` |
+| Number of overlapping units for each split. | `0` | `--splitter-split-overlap` | `ILAB_SPLITTER_SPLIT_OVERLAP` |
+| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` |
+| Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |


Many (all?) databases allow multiple document collections. That should probably be a parameter as well.

anastasds · 2024-12-06T18:10:35Z

docs/cli/ilab-rag-retrieval.md

+|-------------------|-------------|---------------|----------|----------------------|
+| chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_CHAT_RAG_ENABLED` |
+| chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_CHAT_RAG_RETRIEVER_TOP_K` |
+| chat.rag.prompt | Prompt template for RAG-based queries. | Examples below | `--rag-prompt` | `ILAB_CHAT_RAG_PROMPT` |


If there is a way to unify prompt templates throughout InstructLab into one place rather than adding another place to store them, that would probably be ideal.

Would this deserve its own ADR?

Is it something we should let the user to configure or are you just thinking of where to place an hardcoded prompt?

Yes, I will add this to the list of planned ADRs.

If there is a way to unify prompt templates throughout InstructLab into one place rather than adding another place to store them, that would probably be ideal.

+1

anastasds · 2024-12-06T18:12:00Z

There are many design decisions being made here that appear to be in a bit of a vacuum and so increase complexity of product usage and configuration while there are opportunities to streamline it instead.

@jwm4 I think we need to dedicate a significant effort to work through these as a group. I left comments on what I saw in a first pass.

jwm4

This is starting to look good to me. I still have some minor disagreements about technical details (see comments below) but mostly this is feeling like it is on the right track.

jwm4 · 2024-12-06T17:49:20Z

docs/cli/ilab-rag-retrieval.md

+| Vector DB connection token. | | `--vectordb-token` | `ILAB_VECTORDB_TOKEN` |
+| Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` |
+| Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` |
+| Name of the embedding model. | `sentence-transformers/all-minilm-l6-v2` | `--model` | `ILAB_EMBEDDING_MODEL_NAME` |


I am planning to do a separate ADR for the default embedding model. For the purpose of this document would it be OK to just replace sentence-transformers/all-minilm-l6-v2 with TBD?

jwm4 · 2024-12-06T17:56:55Z

docs/cli/ilab-rag-retrieval.md

+| How to split the documents. One of `page`, `passage`, `sentence`, `word`, `line` | `word` | `--splitter-split-by` | `ILAB_SPLITTER_SPLIT_BY` |
+| Maximum number of units in each split. | `200` | `--splitter-split-length` | `ILAB_SPLITTER_SPLIT_LENGTH` |
+| Number of overlapping units for each split. | `0` | `--splitter-split-overlap` | `ILAB_SPLITTER_SPLIT_OVERLAP` |
+| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` |


I am not sure all these splitter options make sense in the context of the Docling hierarchical splitting capability. Also, regardless of the underlying technology, the underlying embedding models only allow a certain number of tokens to encode. So if you let the users split on chunks of 2 pages (for example), what do we do when we need to create the vectors? Just take the first K tokens of each chunk? It feels like we're giving users too much freedom to do things that don't make sense here without also making it clear what the consquences of doing so would be. We should discuss this topic more.

As replied before, while we wait for integrating the docling chunkers, we can drop these settings and use some opinionated defaults for now.

The other question that @ilan-pinto raised around this topic is whether we really need any chunking at all, since the SDG formatting already chunks the original user documents into . Should we review this step?

jwm4 · 2024-12-06T18:17:48Z

docs/cli/ilab-rag-retrieval.md

+```
+
+### 2.7 References
+* [Haystack-DocumentSplitter](https://github.com/deepset-ai/haystack/blob/f0c3692cf2a86c69de8738d53af925500e8a5126/haystack/components/preprocessors/document_splitter.py#L55)


I think probably the Haystack splitter will wind up getting dropped from the solution in favor of something Docling-based.

Adding a note that this is a temporary (non configurable) option.

bbrowning

Overall I have some concerns about this approach, especially in light of the current changes happening in SDG. I think a lot of this approach is based on where SDG was and not where SDG is going, but this work wouldn't land in SDG until after we've reconciled with the research changes, have the ability to create custom Pipeline Blocks, expect users to create and execute their own Pipelines, and split out data preprocessing from data generation from data postprocessing.

I think the entire approach to generating vector embeddings and populating those in a vector database could probably be handled with the existing (post-reconcile with Research fork) SDG code along with a custom Pipeline Block implementation or two. We don't document how to do this yet, as the code is just landing, but that's our designated extension mechanism to do any random thing you want during a data generation pipeline.

bbrowning · 2024-12-09T18:41:25Z

docs/cli/ilab-rag-retrieval.md

+The rationale behind this choice is that the `data process` command can support future workflows, making its
+introduction an investment to anticipate other needs.
+
+Since the RAG behavior is the only functionality of this new command, executions without the `--rag` option will result


Just a note that ilab data process does not exist yet, so I'd be careful about designing things that layer on top of it until we see when/if that gets implemented in its current form.

bbrowning · 2024-12-09T18:42:09Z

docs/cli/ilab-rag-retrieval.md

+
+#### Assumptions
+The provided documents must be in JSON format according to the InstructLab schema: this is the schema generated
+when transforming knowledge documents with the `ilab data generate` command (see 


ilab data generate does not output documents in an InstructLab schema, at least not as referenced here. Even once we separate out preprocessing from generation from postprocessing in ilab data commands, we may keep ilab data generate as it is today for backwards compatibility. I don't think that's decided yet.

Agree on the InstructLab schema comment, but these pre-processed artifacts are identified in this way in William's document "WC - RAG Artifacts with RHEL AI (PM perspective)" (I can share a link in DM if needed)

Of course ilab data generate can remain as it is today, this is outside the purpose of this design document

bbrowning · 2024-12-09T18:44:39Z

docs/cli/ilab-rag-retrieval.md

+transformation, leveraging on the `instructlab-sdg` modules. 
+
+### Why We Need It
+This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document,


I'm not aware of any intention to remove the need for qna.yaml from ilab data generate. My latest understanding is that we hope to end up with separate commands for preprocessing qna.yaml into data samples, running a generation pipeline, and post-processing generated results into final mixed datasets. ilab data generate encompasses all 3 of those stages today, and may continue to. However, we do plan to have some step that starts at input data samples and runs generation pipelines, without a qna.yaml required. It will likely not be called ilab data generate but that's still undecided.

Agree, and there was no intention to remove the need for qna.yaml from ilab data generate

bbrowning · 2024-12-09T18:47:11Z

docs/cli/ilab-rag-retrieval.md

+)
+```
+
+### 2.3 Embedding Ingestion Pipeline Options


I wonder if instead of wiring special support for all this into ilab, we should just consider generating and inserting vector embeddings stages in a data generation pipeline? We're moving to a model where users can supply their own custom pipeline and create their own custom pipeline blocks. So, we should ship (or the user could define) a RAG pipeline that handles turning the data samples into embeddings and storing them in a vector database all as part of our existing pipeline flows, without any code changes in SDG itself?

Is there any design document that you can share about this initiative?

anastasds · 2024-12-09T19:37:32Z

@bbrowning

a custom Pipeline Block implementation or two

What would the purpose of indexing generated data be?

bbrowning · 2024-12-09T20:06:09Z

@bbrowning

a custom Pipeline Block implementation or two

What would the purpose of indexing generated data be?

I don't mean indexing generated data - I mean using our pipelines concept to run a RAG pipeline that generates embeddings, populates a vector db, whatever you need - as opposed to calling an LLM for inference and data generation. Pipelines take an input dataset, have a sequences of Blocks that get executed in step, with the first block getting each input sample as input, it transforms those samples in some way, outputs samples, and the next block gets those new samples as its input. Today we mostly use this for transforming data in datasets, building prompts and calling LLMs for inference, but you could also use this concept to tokenize text and insert into a vector db. A RAG pipeline just becomes another set of pipelines shipped with the product versus code custom and specific to the RAG use-case, other than perhaps some RAG-specific Blocks we'd like to ship in the product itself.

It may be hard to understand how this all works without understanding the code of SDG including the upcoming changes to it, but we should at least try to use the designed SDG extension points of custom Blocks for part of this I think.

anastasds · 2024-12-09T20:12:30Z

Ah you mean creating a new pipeline for this that has nothing to do SDG. That sounds like it might be a very flexible solution, but at the cost of understandability etc.

franciscojavierarceo · 2024-12-10T15:45:26Z

There are many design decisions being made here that appear to be in a bit of a vacuum and so increase complexity of product usage and configuration while there are opportunities to streamline it instead.

@jwm4 I think we need to dedicate a significant effort to work through these as a group. I left comments on what I saw in a first pass.

I very much agree with this conclusion.

It also has extraordinary consequences for our enterprise customers at the RHOAI scale.

dmartinol · 2024-12-11T10:16:02Z

@jwm4 @anastasds integrated changes from yesterday's meeting. Should we move it from Draft to Ready?

dmartinol · 2024-12-15T11:25:33Z

@jwm4 @anastasds I added a "User Experience Overview" section with diagrams to clarify the workflows we discussed in the last review meeting. Please share your feedback
FYI, @ilan-pinto

dmartinol · 2024-12-15T14:12:25Z

FYI, @jwm4 @anastasds I added a "Design Considerations" section to outline key design requirements.

anastasds · 2024-12-16T18:17:29Z

docs/cli/ilab-rag-retrieval.md

+| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
+
+### 2.6 RAG Chat Pipeline Command
+The proposal is to add a `--rag` flag to the `model chat` command, like:


Since I asked the clarifying question in a meeting, let's clearly say that if --rag is for overwriting file-based configuration and that if the config file has RAG enabled, the user does not need to pass this parameter.

reidliu41 · 2024-12-17T13:02:23Z

docs/cli/ilab-rag-retrieval.md

+The proposal is to add an `ingest` sub-command to the `data` command group:
+```
+ilab data ingest /path/to/docs/folder
+```


It might be better to have a default folder path? Or user can specify??

This part has been just updated. So we can have 2 ways to run ingestion:

from documents processed by ilab data gebnerate, e.g. from the latest .../datasets/documents-ABC/docling-artifacts folder. In this case there is no need to specify any path

from user documents processed with ilab data process with no taxonomy definitions. In this case the path is specified by the user in both commands

jwm4 · 2024-12-17T21:43:08Z

docs/cli/ilab-rag-retrieval.md

+
+### 1.2 Model Training Path
+This flow is designed for users who aim to train their own models and leverage the source documents that support knowledge submissions to enhance the chat context:
+![model-training](../images/rag-model-training.png)


I like these images as-is, but they do appear to violate the official dev-docs guidelines on images as specified here. My inclination is to leave them as is since they look good, but if the oversight committee decides to be very strict about this guideline then we might need to redo them.

Images were actually generated from Excalidraw, one of the recommended tool, and I also added a link to edit them for maintenance and sharing purposes: it's in another section below this paragraph, I can move it up if needed.
Could you clarify what is the violation that you see in the guidelines? (I had already read them before, that's why I used this tool, BTW)

@jwm4 @anastasds please review the "Options to Rebuild Excalidraw Diagrams:" section at the top of the document.

docs/rag/ilab-rag-retrieval.md

jwm4

Minor spelling issues

docs/rag/ilab-rag-retrieval.md

jwm4

This is mostly looking good to me. I am requesting a few minor changes.

docs/rag/ilab-rag-retrieval.md

jwm4 · 2025-01-08T17:47:39Z

docs/rag/ilab-rag-retrieval.md

+| Option Description | Default Value | CLI Flag | Environment Variable |
+|--------------------|---------------|----------|----------------------|
+| Location folder of user documents. In case it's missing, the taxonomy is navigated to look for updated knowledge documents.|  | `--input` | `ILAB_PROCESS_INPUT` |
+| Location folder of processed documents. |  | `--ouput` | `ILAB_PROCESS_OUTPUT` |


@anastasds , should ILAB_PROCESS_INPUT and ILAB_PROCESS_OUTPUT be ILAB_CONVERT_INPUT and ILAB_CONVERT_OUTPUT, since process was renamed to convert?

Yes, missed that, thanks - submitted fix in dmartinol#4

Merged, thanks!

nathan-weinberg

A few questions/comments but overall LGTM - won't block on anything I've stated here

nathan-weinberg · 2025-01-08T19:12:51Z

docs/rag/ilab-rag-retrieval.md

+(RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
+and RAG-based chat pipelines:
+
+* A new `ilab rag` command group, feature gated behind a `ILAB_DEV_PREVIEW` environment variable.


What's the point of the feature gate?

@cdoern and @bbrowning were concerned that if this were released as dev preview without a gate that it would set expectations that this command would continue to exist, but in reality we haven't really converged on a long term CLI for this functionality.

Gotcha - personally I would simply prefer some kind of user alert (i.e. something like "NOTE: This is an experimental command at this time - once fully supported this warning will go away) over a env var-based feature gate - but if there was a previous convo about this I won't muck up the works, I don't feel that strongly about it 😄

Honestly, going in and trying to implement this, it seems a lot simpler to put a warning on it. Especially since some of the new options are for existing command groups, e.g. chat.rag.enabled. And don't we want users trying out previews?

I would also prefer a warning, but @cdoern and @bbrowning seemed pretty firm in their insistence on a feature gate.

After an offline discussion yesterday, where we landed is that experimental options would cause the application to simply exit with an error message if used without the dev flag being set.

nathan-weinberg · 2025-01-08T19:13:02Z

docs/rag/ilab-rag-retrieval.md

+and RAG-based chat pipelines:
+
+* A new `ilab rag` command group, feature gated behind a `ILAB_DEV_PREVIEW` environment variable.
+* A new `ilab rag` sub-command  group to process customer documentation.


Suggested change

* A new `ilab rag` sub-command group to process customer documentation.

* A new `ilab rag` sub-command group to process user documentation.

nathan-weinberg · 2025-01-08T19:17:53Z

docs/rag/ilab-rag-retrieval.md

+
+### 2.2 Document Processing Pipeline
+
+The proposal is to add a `convert` sub-command to the `rag` command group.


I will note there is an existing ilab model convert command - I think we can still use convert here, but if another term works for this purpose that may be simpler for users

I think using convert is fine since that is the point of the nested cmd structure

nathan-weinberg · 2025-01-08T19:25:17Z

@dmartinol Please squash commits before this is merged, TIA

Signed-off-by: Daniele Martinoli <[email protected]> Signed-off-by: Anastas Stoyanovsky <[email protected]>

cdoern

generally looks good! just a few comments on env var names and values.

cdoern · 2025-01-08T22:20:47Z

docs/rag/ilab-rag-retrieval.md

+(RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
+and RAG-based chat pipelines:
+
+* A new `ilab rag` command group, feature gated behind a `ILAB_DEV_PREVIEW` environment variable.


note, this was decided to be named ILAB_EXPERIMENTAL_ENABLE or something like that rather than dev preview

cdoern · 2025-01-08T22:22:03Z

docs/rag/ilab-rag-retrieval.md

+
+**Note**: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.
+
+### 1.4 Plug-and-Play RAG Path


is this the "main" path? if so should we mark it as such?

I will move it on top of the list, let me know if this is enough. IMO this seems the "main" use case for demo purposes, but in the long run Model Training one may take priority for production cases (together with fine-tuned embedding model, probably)

Yes, I agree with Danielle -- this is the basic case for someone doing a simple demo, and listing it first seems reasonable. However, it seems possible that doing both model training and RAG together will wind up being more important for business use cases. I guess an open question is whether users will ultimately be happy with the assumption that the same documents used for knowledge training are used for RAG -- I can see reasons why some users might want to control those separately. For the initial dev preview though, I think that assumption is fine.

cdoern · 2025-01-08T22:22:53Z

docs/rag/ilab-rag-retrieval.md

+
+### 2.2 Document Processing Pipeline
+
+The proposal is to add a `convert` sub-command to the `rag` command group.


I think using convert is fine since that is the point of the nested cmd structure

cdoern · 2025-01-08T22:25:47Z

docs/rag/ilab-rag-retrieval.md

+
+### 2.3 Document Processing Pipeline Options
+
+**Note**: The `--help` option will be aware of the `rag` command group only if `ILAB_DEV_PREVIEW` environment variable is set to `true`.


should the env be 0/1?

If that's the pattern we usually use, then sure.

cdoern · 2025-01-09T14:33:32Z

docs/rag/ilab-rag-retrieval.md

+
+### 2.5 Embedding Ingestion Pipeline Options
+
+**Note**: The `--help` option will be aware of the `rag` command group only if `ILAB_DEV_PREVIEW` environment variable is set to `true`.


note the env variable name here as well

Suggested change

**Note**: The `--help` option will be aware of the `rag` command group only if `ILAB_DEV_PREVIEW` environment variable is set to `true`.

**Note**: The `--help` option will be aware of the `rag` command group only if `ILAB_ENABLE_EXPERIMENTAL` environment variable is set to `true`.

docs/rag/ilab-rag-retrieval.md

Signed-off-by: Daniele Martinoli <[email protected]>

jwm4

The concerns I raised in my last review appear to be resolved, so I am happy for this to merge now.

Signed-off-by: Daniele Martinoli <[email protected]>

nathan-weinberg requested review from cdoern and nathan-weinberg December 2, 2024 14:57

dmartinol commented Dec 3, 2024

View reviewed changes

docs/cli/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

franciscojavierarceo reviewed Dec 3, 2024

View reviewed changes

docs/cli/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

franciscojavierarceo reviewed Dec 3, 2024

View reviewed changes

docs/cli/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

franciscojavierarceo reviewed Dec 3, 2024

View reviewed changes

jwm4 suggested changes Dec 3, 2024

View reviewed changes

dmartinol marked this pull request as draft December 5, 2024 18:01

anastasds reviewed Dec 6, 2024

View reviewed changes

jwm4 suggested changes Dec 6, 2024

View reviewed changes

bbrowning reviewed Dec 9, 2024

View reviewed changes

dmartinol mentioned this pull request Dec 10, 2024

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines instructlab/instructlab#2736

Draft

11 tasks

anastasds reviewed Dec 16, 2024

View reviewed changes

reidliu41 reviewed Dec 17, 2024

View reviewed changes

jwm4 reviewed Dec 17, 2024

View reviewed changes

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination instructlab/community#418

Open

dmartinol force-pushed the rag-embed-and-chat branch from 9f0d0bc to f292b33 Compare December 18, 2024 08:37

jwm4 reviewed Jan 2, 2025

View reviewed changes

docs/rag/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

jwm4 reviewed Jan 2, 2025

View reviewed changes

docs/rag/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

jwm4 suggested changes Jan 2, 2025

View reviewed changes

docs/rag/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

docs/rag/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

dmartinol force-pushed the rag-embed-and-chat branch from 3d7757f to bfbcef1 Compare January 7, 2025 07:31

dmartinol marked this pull request as ready for review January 8, 2025 15:49

jwm4 suggested changes Jan 8, 2025

View reviewed changes

docs/rag/ilab-rag-retrieval.md Outdated Show resolved Hide resolved

docs/rag/ilab-rag-retrieval.md Show resolved Hide resolved

docs/rag/ilab-rag-retrieval.md Show resolved Hide resolved

dmartinol force-pushed the rag-embed-and-chat branch 2 times, most recently from 3956a88 to ce2ac7f Compare January 8, 2025 17:00

jwm4 reviewed Jan 8, 2025

View reviewed changes

nathan-weinberg approved these changes Jan 8, 2025

View reviewed changes

anastasds mentioned this pull request Jan 8, 2025

feat: Retrieval augmented generation for chat instructlab/instructlab#2886

Draft

6 tasks

RAG ingest and chat ADR

109a8b9

Signed-off-by: Daniele Martinoli <[email protected]> Signed-off-by: Anastas Stoyanovsky <[email protected]>

dmartinol force-pushed the rag-embed-and-chat branch from a9184fe to 109a8b9 Compare January 8, 2025 22:08

anastasds mentioned this pull request Jan 9, 2025

[RAG][Dev] Implement ilab rag convert instructlab/instructlab#2890

Open

cdoern reviewed Jan 9, 2025

View reviewed changes

dmartinol added 2 commits January 9, 2025 16:39

integrated latest comments

140adf1

Signed-off-by: Daniele Martinoli <[email protected]>

one more comment

896dfc1

Signed-off-by: Daniele Martinoli <[email protected]>

jwm4 approved these changes Jan 9, 2025

View reviewed changes

removing reference to env variables

aa505bb

Signed-off-by: Daniele Martinoli <[email protected]>

jwm4 mentioned this pull request Jan 10, 2025

[Draft] ilab rag convert instructlab/instructlab#2902

Draft

6 tasks

dmartinol mentioned this pull request Jan 10, 2025

feat: Ingest document embeddings for Retrieval augmented generation instructlab/instructlab#2903

Open

6 tasks

	* A new `ilab rag` sub-command group to process customer documentation.
	* A new `ilab rag` sub-command group to process user documentation.


		### 2.2 Document Processing Pipeline

		The proposal is to add a `convert` sub-command to the `rag` command group.


		Note: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.

		### 1.4 Plug-and-Play RAG Path


		### 2.3 Document Processing Pipeline Options

		Note: The `--help` option will be aware of the `rag` command group only if `ILAB_DEV_PREVIEW` environment variable is set to `true`.


		### 2.5 Embedding Ingestion Pipeline Options

		Note: The `--help` option will be aware of the `rag` command group only if `ILAB_DEV_PREVIEW` environment variable is set to `true`.

RAG ingestion and chat pipelines #161

Are you sure you want to change the base?

RAG ingestion and chat pipelines #161

Conversation

dmartinol commented Dec 2, 2024 • edited Loading

anastasds commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwm4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anastasds commented Dec 3, 2024

dmartinol commented Dec 5, 2024

dmartinol commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anastasds commented Dec 6, 2024 • edited Loading

jwm4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbrowning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmartinol Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anastasds commented Dec 9, 2024

bbrowning commented Dec 9, 2024

anastasds commented Dec 9, 2024

franciscojavierarceo commented Dec 10, 2024

dmartinol commented Dec 11, 2024

dmartinol commented Dec 15, 2024 • edited Loading

dmartinol commented Dec 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwm4 left a comment

Choose a reason for hiding this comment

jwm4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmartinol commented Dec 2, 2024 •

edited

Loading

anastasds commented Dec 6, 2024 •

edited

Loading

dmartinol Dec 10, 2024 •

edited

Loading

dmartinol commented Dec 15, 2024 •

edited

Loading

anastasds Jan 8, 2025 •

edited

Loading