GitHub - dommyrock/doc-rag-demo: Experiments with Candle, local LLM's and RAG

RAG demo

Built for the conf. talk which I held live for Engineering division @ Endava.
The Goal was to make intro for Embeddings, Tensors, Quantization, RAG and other common ML concepts.
Presentation link

Build / Run

# build 
cargo b

# examples --running inference on different models

#run jina-bert in release (gpu not supported > no cuda implementation for softmax-last-dim)
cargo run --release --bin jina-bert -- --cpu --prompt "The best thing about coding in rust is "

#run quantized models (usually cpu only)
#by default '7b-mistral-instruct-v0.2' weights get downloaded & loaded
cargo run --release --bin quantized -- --cpu --prompt "The best thing about coding in rust is "

#to run speciffic model see /quantiezed/src/main.rs (enum Which) for supported models 
cargo run --release --bin quantized -- --which mixtral --prompt "The best thing about coding in rust is "

# mistral (i wasn't able to run on 8Gb /dedicated --> Weights > 8gb)
cargo run --release -- --prompt 'Write helloworld code in Rust' --sample-len 150

# gemma models 

# ACCES Precondiitions > https://github.com/huggingface/candle/tree/main/candle-examples/examples/gemma
# HF auth CLI: https://huggingface.co/docs/huggingface_hub/guides/cli#getting-started

# 1 pip install -U "huggingface_hub[cli]"
# 2 huggingface-cli login 
# 3 copy token from cli URL
cargo run --bin gemma --release -- --which code-7b-it  --prompt "fn count_primes(max_n: usize)"
# On <= 2000 series NVDA chips: 
# Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_u32_bf16
# See: https://github.com/huggingface/candle/issues/1911 (Only supported on RTX 3000+ chips >= 8.0 Compute capabilities)

Update to latest crate version (if current 'candle' is stale)

# 1
rm -rf target
# 2 
cargo update
# 3
cargo build

TO Resolve cuda runtime issues see : Error: Cuda("no cuda implementation for softmax-last-dim")#1330

#1  Add cuda feature to your candle-transformers dep (same as for the candle-core)
... "features = ["cuda"]"
 candle-transformers = { git = "https://github.com/huggingface/candle.git", version = "0.4.2", features = ["cuda"] }
#2 Run model as normal
cargo run --release --bin quantized -- --prompt "The best thing about coding in rust is "

What are Tensors?

1. a 1D block of memory called Storage that holds the raw data, and
1. a View over that storage that holds its shape. PyTorch Internals could be helpful here.

REF: LLM C - Karpathy repo

Higgingface/candle model examples

DistilBert repo

Mistral repo

Quantized repo

Quntized mistral HF repo

Weights for Quantized mistral [TheBloke]

Conclusion

Even Quantized models are slow when run on laptop CPU.
( 365 tokens generated: 1.23 token/s) For RAG it seems like running 7B models locally is duable even when quantized.

When you enable 'cuda' feature for transformers you get > 421 tokens generated: 37.81 token/s (quntized model)

Quantization

Weight quantization in large language models (LLMs) or any deep learning models refers to the process of reducing the precision of the model's weights from floating-point representation (e.g., 32-bit floating-point numbers) to lower bit-width representations (e.g., 16-bit, 8-bit, or even 1-bit). The primary goal of weight quantization is to reduce the memory footprint and computational requirements of the model, allowing for faster and more efficient inference on devices with limited resources, such as mobile devices or embedded systems.

Weight quantization typically involves the following steps:

Weight quantization: The model's weights are quantized from higher precision floating-point representations to lower bit-width fixed-point representations. This step usually involves finding a suitable scaling factor to maintain the dynamic range of the weights.
Quantization error compensation: This step aims to minimize the loss in accuracy caused by the quantization process. One common approach is to use a technique called "post-training quantization," where the quantized model is fine-tuned to compensate for the quantization error.
Rounding and clipping: The quantized weights are rounded to the nearest representable value within the target bit-width, potentially introducing some clipping errors in the process.

Simple python code that demonstrats basic RAG flow

Good Rag posts

Distilbert HF

Model doc

Example python implementation

import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" # the device to load the model onto

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2")

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='pt').to(device)
model.to(device)

output = model(**encoded_input)

RAG pipeline (high lvl overview)

Windows download directory for weights/safetensors

C:\Users\dpolzer\.cache\huggingface\hub

GPU monitoring:

Check your GPU VRAM

nvidia-smi

# or watch (loops every few sec)
nvidia-smi -l

# more details
nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
#(example) NVIDIA GeForce RTX 2070 with Max-Q Design, 7.5, 552.12

Cuda compiler driver nvcc

Errors

Non matching file chunk dimensions . If you make a mistake while splitting tokenized chunks from document to non equal (non padded) equal len.
Yor matrice dimensions won't match and when you try to stack them you'll get similar error.

ERROR: shape mismatch in cat for dim 1, shape for arg 1: [1, 84, 768] shape for arg 2: [1, 100, 768]

    let stacked_embeddings = Tensor::stack(&embeddings_arc, 0)?;

ERRORS thrown from Qdrant are usually because it expects one dimensional Tensor
For example document chunk embeddings shape should look like

Tensor SHAPE/Dims: [12, 768]
--> 12 chunks /embeddings of 768 dimensions (generated from distil bert model)

Prompt Shape/Dims: [768] --> (generated from distil bert model)

//you can either call .squeeze(0)?; //to get rid of unwanted dimension

//or get rid of multiple unwanted dimensions and only keep relevant embeddings in the Tensor 
.mean((0, 1))?; //was [1,9,768] -? keep 768

Cuda Errors

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16#2041

Capability docs

CUDA Gpus

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
d-bert		d-bert
gemma		gemma
jina-bert		jina-bert
library		library
llama		llama
mistral		mistral
python		python
quantized		quantized
rag		rag
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
example.env		example.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG demo

Build / Run

Update to latest crate version (if current 'candle' is stale)

What are Tensors?

Higgingface/candle model examples

Quntized mistral HF repo

Conclusion

Quantization

Simple python code that demonstrats basic RAG flow

Distilbert HF

RAG pipeline (high lvl overview)

Windows download directory for weights/safetensors

GPU monitoring:

Errors

Cuda Errors

About

Releases

Packages

Languages

dommyrock/doc-rag-demo

Folders and files

Latest commit

History

Repository files navigation

RAG demo

Build / Run

Update to latest crate version (if current 'candle' is stale)

What are Tensors?

Higgingface/candle model examples

Quntized mistral HF repo

Conclusion

Quantization

Simple python code that demonstrats basic RAG flow

Distilbert HF

RAG pipeline (high lvl overview)

Windows download directory for weights/safetensors

GPU monitoring:

Errors

Cuda Errors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages