- Major Features
- INT8 kvcache
- Varlen flash-decoding
- Major Features
- Fix MiniCPM loader
- FlashDecoding depends on standard flash-attn so.
- Major Features
- Refactor and delete unused code.
- Migrate OpenAI compatible server to zhilight.
- Major Features
- Set HOST_REDUCE=1 for qwen2 14b.
- Major Features
- Fix linear.cpp with bias for qwen2 14b.
- Use TensorRT for Int8Linear.
- Major Features
- GPTQ support marlin kernel.
- Chunked prefill.
- Major Features
- Dynamic batch support PRE_ALLOC_ALL_TOKEN.(Fix OOM)
- Reserve more memory for DUAL_STREAM=1(controled by NEED_DEQUANT_WEIGHT)
- Enlarge BLOCK_M for gemm_warp_reduce(faster W4A16 decode on A800)
- Major Features
- Support CommandR+ model.
- Bug Fix
- Fix desc_act for TP.
- Major Features
- Support DeepSeekV2 model.
- Bug Fix
- Fix ChatMLTokenizer adaption.
- Major Features
- W4A16 support TensorRT kernel: GPTQ_KERNEL_ALGO=2
- W4A8 INT8 support TensorRT kernel: W4_INT8_ALGO=2
- Optimize CPM_FUSE_FF_IN=2
- Bug Fix
- Fix stream decode.
- Bug Fix
- Fix crash when py_task is destroyed.
- Major Features
- Dual stream encode (speedup 20%+ on 4090pro).
- Bug Fix
- Fix flash_attention when real batch > 1.
- Major Features
- Fuse gate activation with multiplication in FeedForward.
- Bug Fix
- Fix Gemm bf16 when HIGH_PRECISION=0.
- Fix silu and gelu from double to single (double is very slow).
- Bug Fix for 0.3.45
- Add 89 to CMAKE_CUDA_ARCHITECTURES to build fp8 kernel.
- Major Features
- W8A8 FP8.
- Update ModelContext::reduce_tp_int8.
- Major Features
- W4A8 fp8 v1.
- Optimize W4A8 int8 v1.
- Major Features
- W4A8 int8 v1.
- Major Features
- Group-wise int8 quant reduce_sum for tensor parallel.
- Bug Fix
- Fix GPTQ new kernel for dynamic batch.
- Bug Fix
- Fix loading CPMBee.
- Major Features
- Fuse GPTQ MOE when batch_size=1.
- Update MOE: change router's out_type to float; use cudaMemcpy.
- GPTQ kernel support HIGH_PRECISION env.
- Optimize gemm_fuse_gate_in: preload matrix A to shared memory; use lop3 instruct.
- Bug Fix
- Remove auto set work memeory.
- Major Features
- Support Qwen models.
- Optimize GPTQ new kernel with symmetric quant.
- Major Features
- Update GPTQ new kernel:
- support desc_act;
- support CPM_FUSE_FF_IN=2 CPM_FUSE_QKV=2 : fuse only for small m.
- Update GPTQ new kernel:
- Features
- Dynamic batch return top_logprobs.
- Major Features
- Add new GPTQ kernel.
- Other Changes
- Disable fuse weight for GPTQ model.
- Add block index into prefix cache key.
- Bug Fix
- Fix parsing config when rope_scaling=None.
- Bug Fix
- Fix chat model eos_id.
- Major Features
- Support minicpm_moe.
- Major Features
- support Dynamic NTK-aware context length extention.
- Bug Fix
- cancel fix: invalid ref when python task obj destructed
- Other Changes
- loader supports AWQ models.
- Bug Fix
- fix long context crash up to 128k(no rope scaling yet).
- Other Changes
- loader supports AWQ models.
- Bug Fix
- fix long context crash up to 128k(no rope scaling yet).
- Bug Fix
- Fix Dragonfly residual scaling depth apply.
- Major Features
- Add AWQ kernel.
- Bug Fix
- Fix attention for long input.
- Bug Fix
- fix typecast of pass-in hidden_states.
- Other Changes
- Improved Paged KVCache page reuse and separate physical/logical paged lifecycle management.
- Fix softmax kernel for long input.
- Major Features
- Copy-on-Write for paged attention, identical outputs of parallel sampling are stored in the same physical blocks.
- Major Features
- Support Paged Attention.
- Support parallel sampling(num_results > 1) with shared prefix kv cache.
- Major Features
- Support AWQ model with exllama kernel.
- Other Changes
- exllama kernel uses float accumulator.
- Bug fix
- Dynamic batch with flash attention: Set len_k to full_input_len during encode.
- Support presence_penalty.
- Fix prompt cache with flash attention.
- Fix stream decode for HF tokenizer.
- Major Features
- Support prompt cache.
- flash decoding layer supports paged attention.
- Config Changes
- GeneratorArg: beam_size default to 1; random search default seed to 42.
- Add QuantType GPTQ
- Speed Optimizations
- Using FlashAttention to copy buffer, first token latency reduces ~= 10%.
- Speed Optimizations
- Split KV Attention during decoding.
- New Features
- Dynamic batch support chat-format input.
- Major Features
- Support GPTQ int4 model.
- Support CUDA12.
- Major Features
- Optimize MOE; MOE support INT8.
- Optimize loading distributed parameters.
- Bugfix for chatml, use CHATML_COMPATIBLE=0 to disable prepend <s>.
- Other Changes
- FlashDecoding now compiles against FlashAttention 2.5(with Paged Attention support).
- Reduce libflash_decoding.a size to 255MB.
- Major Features
- Support MOE model: mixtral and cpmd_moe.
- Add ModelLoader.convert_pt_to_safetensors() util method.
- Major Features
- Support mistral model.
- Support load model safetensors.
- Support load model multiple threads for multiple files.
- Major Features
- FlashDecoding ops now supports batching(enabled in static batch);
- Major Features
- Dynamic batch dragonfly model support.
- rotary embedding support rope_theta config.
- Other Changes
- Loader adapts CPMLive checkpoints weights file naming.
- Major Features
- Dragonfly model support.
- Other Changes
- Loader interface changes, is_chatml option moved to config.json.
- added an explicit model_type options to avoid ambiguous parameters caused misbehaviour.
- Important Bug Fixes
- Fix attention kernel bug introduced in 0.2.92.
- Major Features
- Dynamic batch support flash attention.
- Major Optimizations
- Optimize attention kernel for bfloat16.
- Deprecated Features
- Dynamic batch disable seed for beam search.
- Major Features
- Support flash attention.
- Add first token delay to dynamic batch API.
- Bug Fixes
- Fix random search when num_results > 1.
- Fix loader with chatml and support init tokenizer with tokens.
- Major Optimizations
- Optimize attention kernel.
- Turn on custom kernel for single request.
- Use non-mma version kernel.
- Optimize attention kernel.
- Other changes
- Change RawEmbedding parallel mode to row.
- Bug Fixes
- Fix random search when num_results > 1.
- Fix config bug.
- Speed Optimizations
- Turn on CPM_FUSE_QKV and CPM_FUSE_FF_IN by default.
- Optimize batch_generator.cpp: remove copy context.
- Speed Optimizations
- Use NCCL by default; remove cudaDeviceSynchronize before/after reduce sum.
- Major Optimizations
- Dynamic batch support timeout and cancel.
- Fuse feedforward w_in with w_gated.
- Bug Fixes
- fix bf16 in speculative sampling
- Major Features
- Support Int4
- Major Optimizations
- Fuse int8 linear for QKV
- Code Refactoring
- Fuse NormalLinear with ParallelLinear
- Bug Fixes
- Fix random_search generate
- Fix random_search generate