Built for the conf. talk which I held live for Engineering division @ Endava.
The Goal was to make intro for Embeddings, Tensors, Quantization, RAG and other common ML concepts.
Presentation link
# build
cargo b
# examples --running inference on different models
#run jina-bert in release (gpu not supported > no cuda implementation for softmax-last-dim)
cargo run --release --bin jina-bert -- --cpu --prompt "The best thing about coding in rust is "
#run quantized models (usually cpu only)
#by default '7b-mistral-instruct-v0.2' weights get downloaded & loaded
cargo run --release --bin quantized -- --cpu --prompt "The best thing about coding in rust is "
#to run speciffic model see /quantiezed/src/main.rs (enum Which) for supported models
cargo run --release --bin quantized -- --which mixtral --prompt "The best thing about coding in rust is "
# mistral (i wasn't able to run on 8Gb /dedicated --> Weights > 8gb)
cargo run --release -- --prompt 'Write helloworld code in Rust' --sample-len 150
# gemma models
# ACCES Precondiitions > https://github.com/huggingface/candle/tree/main/candle-examples/examples/gemma
# HF auth CLI: https://huggingface.co/docs/huggingface_hub/guides/cli#getting-started
# 1 pip install -U "huggingface_hub[cli]"
# 2 huggingface-cli login
# 3 copy token from cli URL
cargo run --bin gemma --release -- --which code-7b-it --prompt "fn count_primes(max_n: usize)"
# On <= 2000 series NVDA chips:
# Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_u32_bf16
# See: https://github.com/huggingface/candle/issues/1911 (Only supported on RTX 3000+ chips >= 8.0 Compute capabilities)
# 1
rm -rf target
# 2
cargo update
# 3
cargo build
TO Resolve cuda runtime issues see : Error: Cuda("no cuda implementation for softmax-last-dim")#1330
#1 Add cuda feature to your candle-transformers dep (same as for the candle-core)
... "features = ["cuda"]"
candle-transformers = { git = "https://github.com/huggingface/candle.git", version = "0.4.2", features = ["cuda"] }
#2 Run model as normal
cargo run --release --bin quantized -- --prompt "The best thing about coding in rust is "
-
- a 1D block of memory called Storage that holds the raw data, and
-
- a View over that storage that holds its shape. PyTorch Internals could be helpful here.
Weights for Quantized mistral [TheBloke]
Even Quantized models are slow when run on laptop CPU.
( 365 tokens generated: 1.23 token/s) For RAG it seems like running 7B models locally is duable even when quantized.
When you enable 'cuda' feature for transformers you get > 421 tokens generated: 37.81 token/s (quntized model)
Weight quantization in large language models (LLMs) or any deep learning models refers to the process of reducing the precision of the model's weights from floating-point representation (e.g., 32-bit floating-point numbers) to lower bit-width representations (e.g., 16-bit, 8-bit, or even 1-bit). The primary goal of weight quantization is to reduce the memory footprint and computational requirements of the model, allowing for faster and more efficient inference on devices with limited resources, such as mobile devices or embedded systems.
Weight quantization typically involves the following steps:
- Weight quantization: The model's weights are quantized from higher precision floating-point representations to lower bit-width fixed-point representations. This step usually involves finding a suitable scaling factor to maintain the dynamic range of the weights.
- Quantization error compensation: This step aims to minimize the loss in accuracy caused by the quantization process. One common approach is to use a technique called "post-training quantization," where the quantized model is fine-tuned to compensate for the quantization error.
- Rounding and clipping: The quantized weights are rounded to the nearest representable value within the target bit-width, potentially introducing some clipping errors in the process.
Good Rag posts
- Efficient Information Retrieval with RAG Workflow
- RAG chatbot using qdrant + Gemini
- Implementing RAG w HF + Langchain
Example python implementation
import torch
from transformers import AutoTokenizer, AutoModel
device = "cuda" # the device to load the model onto
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device)
model.to(device)
output = model(**encoded_input)
C:\Users\dpolzer\.cache\huggingface\hub
Check your GPU VRAM
nvidia-smi
# or watch (loops every few sec)
nvidia-smi -l
# more details
nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
#(example) NVIDIA GeForce RTX 2070 with Max-Q Design, 7.5, 552.12
- Non matching file chunk dimensions .
If you make a mistake while splitting tokenized chunks from document to non equal (non padded) equal len.
Yor matrice dimensions won't match and when you try to stack them you'll get similar error.
ERROR: shape mismatch in cat for dim 1, shape for arg 1: [1, 84, 768] shape for arg 2: [1, 100, 768]
let stacked_embeddings = Tensor::stack(&embeddings_arc, 0)?;
- ERRORS thrown from Qdrant are usually because it expects one dimensional Tensor
For example document chunk embeddings shape should look like
Tensor SHAPE/Dims: [12, 768]
--> 12 chunks /embeddings of 768 dimensions (generated from distil bert model)
Prompt Shape/Dims: [768] --> (generated from distil bert model)
//you can either call .squeeze(0)?; //to get rid of unwanted dimension
//or get rid of multiple unwanted dimensions and only keep relevant embeddings in the Tensor
.mean((0, 1))?; //was [1,9,768] -? keep 768
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16#2041