Use the following documentation to learn about the NVIDIA RAG Blueprint.
- Overview
- Target Audience
- Software Components
- Technical Diagram
- Hardware Requirements
- Next Steps
- Available Customizations
- Inviting the community to contribute
- License
This blueprint serves as a reference solution for a foundational Retrieval Augmented Generation (RAG) pipeline. One of the key use cases in Generative AI is enabling users to ask questions and receive answers based on their enterprise data corpus. This blueprint demonstrates how to set up a RAG solution that uses NVIDIA NIM and GPU-accelerated components. By default, this blueprint leverages the NVIDIA-hosted models available in the NVIDIA API Catalog. However, you can replace these models with your own locally-deployed NVIDIA NIM microservices to meet specific data governance and latency requirements.
This blueprint is for:
- Developers: Developers who want a quick start to set up a RAG solution for unstructured data with a path-to-production with the NVIDIA NIM.
The following are the default components included in this blueprint:
- NVIDIA NIM Microservices
- Response Generation (Inference)
- Retriever Models
- Orchestrator server - Langchain based
- Milvus Vector Database - accelerated with NVIDIA cuVS
- Text Splitter: Recursive Character Text Splitter
- Document parsers: Unstructured.io
- File Types: File types supported by unstructured.io. Accuracy is best optimized for files with extension
.pdf
,.txt
and.md
.
We provide Docker Compose scripts that deploy the microservices on a single node. When you are ready for a large-scale deployment, you can use the included Helm charts to deploy the necessary microservices. You use sample Jupyter notebooks with the JupyterLab service to interact with the code directly.
The Blueprint contains sample data from the NVIDIA Developer Blog. You can build on this blueprint by customizing the RAG application to your specific use case.
We also provide a sample user interface named rag-playground
.
The image represents the high level architecture and workflow. The core business logic is defined in the rag_chain_with_multiturn()
method of chains.py
file. Here's a step-by-step explanation of the workflow from end-user perspective:
-
User Interaction via RAG Playground:
- The user interacts with this blueprint by typing queries into the sample UI microservice named as RAG Playground. These queries are sent to the system through the
POST /generate
API exposed by the RAG server microservice. There are separate notebooks available which showcase API usage as well.
- The user interacts with this blueprint by typing queries into the sample UI microservice named as RAG Playground. These queries are sent to the system through the
-
Query Processing:
- The query enters the RAG Server, which is based on LangChain. An optional Query Rewriter component may refine or decontextualize the query for better retrieval results.
-
Retrieval of Relevant Documents:
- The refined query is passed to the Retriever module. This component queries the Milvus Vector Database microservice, which stores embeddings of unstructured data, generated using NeMo Retriever Embedding microservice. The retriever module identifies the top 20 most relevant chunks of information related to the query.
-
Reranking for Precision:
- The top 20 chunks are passed to the optional NeMo Retriever reranking microservice. The reranker narrows down the results to the top 4 most relevant chunks, improving precision.
-
Response Generation:
- The top 4 chunks are injected in the prompt and sent to the Response Generation module, which leverages NeMo LLM inference Microservice to generate a natural language response based on the retrieved information.
-
Delivery of Response:
- The generated response is sent back to the RAG Playground, where the user can view the answer to their query as well as check the output of the retriever module using the
Show Context
option.
- The generated response is sent back to the RAG Playground, where the user can view the answer to their query as well as check the output of the retriever module using the
-
Ingestion of Data:
- Separately, unstructured data is ingested into the system via the
POST /documents
API using theKnowledge Base
tab of RAG Playground microservice. This data is preprocessed, split into chunks and stored in the Milvus Vector Database using embeddings generated by models hosted by NeMo Retriever Embedding microservice.
- Separately, unstructured data is ingested into the system via the
This modular design ensures efficient query processing, accurate retrieval of information, and easy customization.
Following are the hardware requirements for each component. The reference code in the solution (glue code) is referred to as as the "pipeline".
The overall hardware requirements depend on whether you Deploy With Docker Compose or Deploy With Helm Chart.
- GPU Driver - 530.30.02 or later
- CUDA version - 12.6 or later
The NIM and hardware requirements only need to be met if you are self-hosting them. See Using self-hosted NVIDIA NIM microservices.
-
8XH100-80GB or 8XA100-80GB
-
Pipeline operation: 1x L40 GPU or similar recommended. It is needed for Milvus vector store database, if you plan to enable GPU acceleration.
-
(If locally deployed) LLM NIM: Meta Llama 3.1 70B Instruct Support Matrix
- For improved paralleled performance, we recommend 8x or more H100s for LLM inference.
- The pipeline can share the GPU with the LLM NIM, but it is recommended to have a separate GPU for the LLM NIM for optimal performance.
-
(If locally deployed) Embedding NIM: Llama-3.2-NV-EmbedQA-1B-v2 Support Matrix
- The pipeline can share the GPU with the Embedding NIM, but it is recommended to have a separate GPU for the Embedding NIM for optimal performance.
-
(If locally deployed) Reranking NIM: llama-3_2-nv-rerankqa-1b-v1 Support Matrix
- Do the procedures in Get Started to deploy this blueprint
- See the OpenAPI Specification
- Explore notebooks that demonstrate how to use the APIs here
The following are some of the customizations that you can make after you complete the steps in Get Started.
- Change the Inference or Embedding Model
- Customize Your Vector Database
- Customize Your Text Splitter
- Customize Prompts
- Customize LLM Parameters at Runtime
- Support Multi-Turn Conversations
We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback. We invite contributions! To open a GitHub issue or pull request, see the contributing guidelines.
This NVIDIA NVIDIA AI BLUEPRINT is licensed under the Apache License, Version 2.0. This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/), except that models are governed by the AI Foundation Models Community License Agreement (found at NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License) and NVIDIA dataset is governed by the NVIDIA Asset License Agreement found here.
For Meta/llama-3.1-70b-instruct model the Llama 3.1 Community License Agreement, for nvidia/llama-3.2-nv-embedqa-1b-v2model the Llama 3.2 Community License Agreement, and for the nvidia/llama-3.2-nv-rerankqa-1b-v2 model the Llama 3.2 Community License Agreement. Built with Llama.