Upstream changes for v0.3.0 release (#29)

- Detailed changes mentioned in CHANGELOG.md file
NVIDIA · Jan 22, 2024 · 3d29acf · 3d29acf
1 parent 5c1b121
commit 3d29acf
Show file tree

Hide file tree

Showing 126 changed files with 5,459 additions and 1,033 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,14 @@
+# Python Exclusions
+.venv
+**__pycache__**
+
+# Helm Exclusions
+**/charts/*.tgz
+
+# project temp files
+deploy/*.log
+deploy/*.txt
+
+# Docker Compose exclusions
+volumes/
+uploaded_files/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,25 +3,52 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.3.0] - 2024-01-22
+
+### Added
+
+- [New dedicated example](./docs/rag/aiplayground.md) showcasing Nvidia AI Playground based models using Langchain connectors.
+- [New example](./RetrievalAugmentedGeneration/README.md#5-qa-chatbot-with-task-decomposition-example----a100h100l40s) demonstrating query decomposition.
+- Support for using [PG Vector as a vector database in the developer rag canonical example.](./RetrievalAugmentedGeneration/README.md#deploying-with-pgvector-vector-store)
+- Support for using Speech-in Speech-out interface in the sample frontend leveraging RIVA Skills.
+- New tool showcasing [RAG observability support.](./tools/observability/)
+- Support for on-prem deployment of [TRTLLM based nemotron models.](./RetrievalAugmentedGeneration/README.md#6-qa-chatbot----nemotron-model)
+
+### Changed
+
+- Upgraded Langchain and llamaindex dependencies for all container.
+- Restructured [README](./README.md) files for better intuitiveness.
+- Added provision to plug in multiple examples using [a common base class](./RetrievalAugmentedGeneration/common/base.py).
+- Changed `minio` service's port to `9010`from `9000` in docker based deployment.
+- Moved `evaluation` directory from top level to under `tools` and created a [dedicated compose file](./deploy/compose/docker-compose-evaluation.yaml).
+- Added an [experimental directory](./experimental/) for plugging in experimental features.
+- Modified notebooks to use TRTLLM and Nvidia AI foundation based connectors from langchain.
+- Changed `ai-playground` model engine name to `nv-ai-foundation` in configurations.
+
+### Fixed
+
+- [Fixed issue #19](https://github.com/NVIDIA/GenerativeAIExamples/issues/19)
+
 
 ## [0.2.0] - 2023-12-15
 
 ### Added
 
-- Support for using [Nvidia AI Foundational LLM models](./docs/rag/aiplayground.md#using-nvdia-cloud-based-llms)
-- Support for using [Nvidia AI Foundational embedding models](./docs/rag/aiplayground.md#using-nvidia-cloud-based-embedding-models)
+- Support for using [Nvidia AI Playground based LLM models](./docs/rag/aiplayground.md)
+- Support for using [Nvidia AI Playground based embedding models](./docs/rag/aiplayground.md)
 - Support for [deploying and using quantized LLM models](./docs/rag/llm_inference_server.md#quantized-llama2-model-deployment)
-- Support for [evaluating RAG pipeline](./evaluation/README.md)
+- Support for Kubernetes deployment support using helm charts
+- Support for [evaluating RAG pipeline](./tools/evaluation/README.md)
 
 ### Changed
 
 - Repository restructing to allow better open source contributions
 - [Upgraded dependencies](./RetrievalAugmentedGeneration/Dockerfile) for chain server container
-- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed now for access.
+- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed for access.
 - Main [README](./README.md) now provides more details.
 - Documentation improvements.
-- Better error handling and reporting mechanism for corner cases.
-- Renamed `triton-inference-server` container and service to `llm-inference-server`
+- Better error handling and reporting mechanism for corner cases
+- Renamed `triton-inference-server` container to `llm-inference-server`
 
 ### Fixed
 

diff --git a/README.md b/README.md
@@ -8,40 +8,67 @@ Generative AI Examples uses resources from the [NVIDIA NGC AI Development Catalo
 
 Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
 
-- The GPU-optimized NVIDIA containers, models, scripts, and tools used in these examples
-- The latest NVIDIA upstream contributions to the respective programming frameworks
-- The latest NVIDIA Deep Learning and LLM software libraries
-- Release notes for each of the NVIDIA optimized containers
-- Links to developer documentation
+- GPU-optimized containers used in these examples
+- Release notes and developer documentation
 
 ## Retrieval Augmented Generation (RAG)
 
-A RAG pipeline embeds multimodal data --  such as documents, images, and video -- into a database connected to a Large Language Model.  RAG lets users use an LLM to chat with their own data.
+A RAG pipeline embeds multimodal data --  such as documents, images, and video -- into a database connected to a LLM.  RAG lets users chat with their data!
 
-| Name          | Description           | LLM        | Framework               | Multi-GPU | Multi-node | Embedding   | TRT-LLM | Triton | VectorDB | K8s |
-|---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|----------|-----|
-| [Linux developer RAG](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration) | Single VM, single GPU | llama2-13b | Langchain + Llama Index | No        | No         | e5-large-v2 | Yes     | Yes    | Milvus   | No  |
-| [Windows developer RAG](https://github.com/NVIDIA/trt-llm-rag-windows) | RAG on Windows | llama2-13b | Llama Index | No        | No         | NA | Yes     | No    | FAISS   | NA  |
-| [Developer LLM Operator for Kubernetes](./docs/developer-llm-operator/) | Single node, single GPU | llama2-13b | Langchain + Llama Index |  No | No | e5-large-v2 | Yes | Yes | Milvus | Yes |
+### Developer RAG Examples
 
+The developer RAG examples run on a single VM. They demonstrate how to combine NVIDIA GPU acceleration with popular LLM programming frameworks using NVIDIA's [open source connectors](#open-source-integrations). The examples are easy to deploy via [Docker Compose](https://docs.docker.com/compose/).
 
-## Large Language Models
-NVIDIA LLMs are optimized for building enterprise generative AI applications.
+Examples support local and remote inference endpoints. If you have a GPU, you can inference locally via [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). If you don't have a GPU, you can inference and embed remotely via [NVIDIA AI Foundations endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/).
 
-| Name          | Description           | Type       | Context Length | Example | License |
-|---------------|-----------------------|------------|----------------|---------|---------|
-| [nemotron-3-8b-qa-4k](https://huggingface.co/nvidia/nemotron-3-8b-qa-4k) | Q&A LLM customized on knowledge bases | Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
-| [nemotron-3-8b-chat-4k-steerlm](https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-steerlm) | Best out-of-the-box chat model with flexible alignment at inference | Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
-| [nemotron-3-8b-chat-4k-rlhf](https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-rlhf) | Best out-of-the-box chat model performance| Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
+| Model         | Embedding           | Framework        | Description               | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
+|---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|
+| llama-2 | e5-large-v2 | Llamaindex | Canonical QA Chatbot | [YES](RetrievalAugmentedGeneration/README.md#3-qa-chatbot-multi-gpu----a100h100l40s)        | [YES](RetrievalAugmentedGeneration/README.md#2-qa-chatbot----a100h100l40s-gpu)       | No | YES     | Milvus/[PGVector]((RetrievalAugmentedGeneration/README.md#2-qa-chatbot----a100h100l40s-gpu))|
+| mixtral_8x7b | nvolveqa_40k | Langchain | [Nvidia AI foundation based QA Chatbot](RetrievalAugmentedGeneration/README.md#1-qa-chatbot----nvidia-ai-foundation-inference-endpoint)  | No        | No       | YES | YES     | FAISS|
+| llama-2 | all-MiniLM-L6-v2 | Llama Index | [QA Chatbot, GeForce, Windows](https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0)  | NO        | YES        | NO | NO     | FAISS |
+| llama-2 | nvolveqa_40k | Langchain | [QA Chatbot, Task Decomposition Agent](./RetrievalAugmentedGeneration/README.md#5-qa-chatbot-with-task-decomposition-example----a100h100l40s) | No | No | YES | YES | FAISS
+| mixtral_8x7b | nvolveqa_40k | Langchain | [Minimilastic example showcasing RAG using Nvidia AI foundation models](./examples/README.md#rag-in-5-minutes-example)  | No        | No       | YES | YES     | FAISS|
 
 
-## Integration Examples
+
+### Enterprise RAG Examples
+
+The enterprise RAG examples run as microservies distributed across multiple VMs and GPUs. They show how RAG pipelines can be orchestrated with [Kubernetes](https://kubernetes.io/) and deployed with [Helm](https://helm.sh/).
+
+Enterprise RAG examples include a [Kubernetes operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) for LLM lifecycle management. It is compatible with the [NVIDIA GPU operator](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator) that automates GPU discovery and lifecycle management in a Kubernetes cluster.
+
+Enterprise RAG examples also support local and remote inference via [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [NVIDIA AI Foundations endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/).
+
+| Model         | Embedding           | Framework        | Description               | Multi-GPU | Multi-node | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
+|---------------|-----------------------|------------|--------|-------------------------|-----------|------------|-------------|---------|--------|
+| llama-2 | NV-Embed-QA-003 | Llamaindex | QA Chatbot, Helm, k8s  | NO        | NO | [YES](./docs/developer-llm-operator/)         | NO | YES     | Milvus|
+
+## Tools
+
+Example tools and tutorials to enhance LLM development and productivity when using NVIDIA RAG pipelines.
+
+| Name | Description | Deployment | Tutorial |
+|------|-------------|------|--------|
+| Evaluation | Example open source RAG eval tool that uses synthetic data generation and LLM-as-a-judge |  [Docker compose file](./deploy/compose/docker-compose-evaluation.yaml) | [README](./docs/rag/evaluation.md) |]
+| Observability | Observability serves as an efficient mechanism for both monitoring and debugging RAG pipelines. |  [Docker compose file](./deploy/compose/docker-compose-observability.yaml) | [README](./docs/rag/observability.md) |]
+
+## Open Source Integrations
+
+These are open source connectors for NVIDIA-hosted and self-hosted API endpoints. These open source connectors are maintained and tested by NVIDIA engineers.
+
+| Name | Framework | Chat | Text Embedding | Python | Description |
+|------|-----------|------|-----------|--------|-------------|
+|[NVIDIA AI Foundation Endpoints](https://python.langchain.com/docs/integrations/providers/nvidia) | [Langchain](https://www.langchain.com/) |[YES](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints)|[YES](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints)|[YES](https://pypi.org/project/langchain-nvidia-ai-endpoints/)|Easy access to NVIDIA hosted models. Supports chat, embedding, code generation, steerLM, multimodal, and RAG.|
+|[NVIDIA Triton + TensorRT-LLM](https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt) | [Langchain](https://www.langchain.com/) |[YES](https://github.com/langchain-ai/langchain/blob/master/libs/partners/nvidia-trt/docs/llms.ipynb)|[YES](https://github.com/langchain-ai/langchain/blob/master/libs/partners/nvidia-trt/docs/llms.ipynb)|[YES](https://pypi.org/project/langchain-nvidia-trt/)|This connector allows Langchain to remotely interact with a Triton inference server over GRPC or HTTP tfor optimized LLM inference.|
+|[NVIDIA Triton Inference Server](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_triton.html) | [LlamaIndex](https://www.llamaindex.ai/) |YES|YES|NO|Triton inference server provides API access to hosted LLM models over gRPC. |
+|[NVIDIA TensorRT-LLM](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt.html) | [LlamaIndex](https://www.llamaindex.ai/) |YES|YES|NO|TensorRT-LLM provides a Python API to build TensorRT engines with state-of-the-art optimizations for LLM inference on NVIDIA GPUs. |
+
 
 ## NVIDIA support
-In each of the READMEs, we indicate the level of support provided.
+In each example README we indicate the level of support provided.
 
 ## Feedback / Contributions
-We're posting these examples on GitHub to better support the community, facilitate feedback, as well as collect and implement contributions using GitHub Issues and pull requests. We welcome all contributions!
+We're posting these examples on GitHub to support the NVIDIA LLM community, facilitate feedback. We invite contributions via GitHub Issues or pull requests!
 
 ## Known issues
 - In each of the READMEs, we indicate any known issues and encourage the community to provide feedback.

diff --git a/RetrievalAugmentedGeneration/.gitattributes b/RetrievalAugmentedGeneration/.gitattributes
diff --git a/RetrievalAugmentedGeneration/.gitignore b/RetrievalAugmentedGeneration/.gitignore
diff --git a/RetrievalAugmentedGeneration/Dockerfile b/RetrievalAugmentedGeneration/Dockerfile
@@ -1,14 +1,22 @@
 ARG BASE_IMAGE_URL=nvcr.io/nvidia/pytorch
 ARG BASE_IMAGE_TAG=23.08-py3
 
-
 FROM ${BASE_IMAGE_URL}:${BASE_IMAGE_TAG}
+
+ARG EXAMPLE_NAME
 COPY RetrievalAugmentedGeneration/__init__.py /opt/RetrievalAugmentedGeneration/
 COPY RetrievalAugmentedGeneration/common /opt/RetrievalAugmentedGeneration/common
-COPY RetrievalAugmentedGeneration/examples /opt/RetrievalAugmentedGeneration/examples
+COPY RetrievalAugmentedGeneration/examples/${EXAMPLE_NAME} /opt/RetrievalAugmentedGeneration/example
 COPY integrations /opt/integrations
+COPY tools /opt/tools
+RUN apt-get update && apt-get install -y libpq-dev
 RUN --mount=type=bind,source=RetrievalAugmentedGeneration/requirements.txt,target=/opt/requirements.txt \
     python3 -m pip install --no-cache-dir -r /opt/requirements.txt
 
+RUN if [ -f "/opt/RetrievalAugmentedGeneration/example/requirements.txt" ] ; then \
+    python3 -m pip install --no-cache-dir -r /opt/RetrievalAugmentedGeneration/example/requirements.txt ; else \
+    echo "Skipping example dependency installation, since requirements.txt was not found" ; \
+    fi
+
 WORKDIR /opt
 ENTRYPOINT ["uvicorn", "RetrievalAugmentedGeneration.common.server:app"]