Skip to content

Latest commit

 

History

History
387 lines (280 loc) · 7.6 KB

CHANGELOG.md

File metadata and controls

387 lines (280 loc) · 7.6 KB

zhilight Changelog

0.4.3~4

  • Major Features
    • INT8 kvcache
    • Varlen flash-decoding

0.4.2

  • Major Features
    • Fix MiniCPM loader
    • FlashDecoding depends on standard flash-attn so.

0.4.1

  • Major Features
    • Refactor and delete unused code.
    • Migrate OpenAI compatible server to zhilight.

0.3.83~84

  • Major Features
    • Set HOST_REDUCE=1 for qwen2 14b.

0.3.81~82

  • Major Features
    • Fix linear.cpp with bias for qwen2 14b.
    • Use TensorRT for Int8Linear.

0.3.80

  • Major Features
    • GPTQ support marlin kernel.
    • Chunked prefill.

0.3.77~79

  • Major Features
    • Dynamic batch support PRE_ALLOC_ALL_TOKEN.(Fix OOM)
    • Reserve more memory for DUAL_STREAM=1(controled by NEED_DEQUANT_WEIGHT)
    • Enlarge BLOCK_M for gemm_warp_reduce(faster W4A16 decode on A800)

0.3.76

  • Major Features
    • Support CommandR+ model.
  • Bug Fix
    • Fix desc_act for TP.

0.3.68-75

  • Major Features
    • Support DeepSeekV2 model.

0.3.67

  • Bug Fix
    • Fix ChatMLTokenizer adaption.

0.3.63~65

  • Major Features
    • W4A16 support TensorRT kernel: GPTQ_KERNEL_ALGO=2
    • W4A8 INT8 support TensorRT kernel: W4_INT8_ALGO=2
    • Optimize CPM_FUSE_FF_IN=2

0.3.62

  • Bug Fix
    • Fix stream decode.

0.3.59~61

  • Bug Fix
    • Fix crash when py_task is destroyed.

0.3.48~58

  • Major Features
    • Dual stream encode (speedup 20%+ on 4090pro).
  • Bug Fix
    • Fix flash_attention when real batch > 1.

0.3.47

  • Major Features
    • Fuse gate activation with multiplication in FeedForward.
  • Bug Fix
    • Fix Gemm bf16 when HIGH_PRECISION=0.
    • Fix silu and gelu from double to single (double is very slow).

0.3.46

  • Bug Fix for 0.3.45
    • Add 89 to CMAKE_CUDA_ARCHITECTURES to build fp8 kernel.

0.3.45

  • Major Features
    • W8A8 FP8.
    • Update ModelContext::reduce_tp_int8.

0.3.44

  • Major Features
    • W4A8 fp8 v1.
    • Optimize W4A8 int8 v1.

0.3.43

  • Major Features
    • W4A8 int8 v1.

0.3.41-42

  • Major Features
    • Group-wise int8 quant reduce_sum for tensor parallel.

0.3.39

  • Bug Fix
    • Fix GPTQ new kernel for dynamic batch.

0.3.38

  • Bug Fix
    • Fix loading CPMBee.

0.3.37

  • Major Features
    • Fuse GPTQ MOE when batch_size=1.
    • Update MOE: change router's out_type to float; use cudaMemcpy.
    • GPTQ kernel support HIGH_PRECISION env.
    • Optimize gemm_fuse_gate_in: preload matrix A to shared memory; use lop3 instruct.

0.3.35

  • Bug Fix
    • Remove auto set work memeory.

0.3.34

  • Major Features
    • Support Qwen models.
    • Optimize GPTQ new kernel with symmetric quant.

0.3.33

  • Major Features
    • Update GPTQ new kernel:
      • support desc_act;
      • support CPM_FUSE_FF_IN=2 CPM_FUSE_QKV=2 : fuse only for small m.

0.3.32

  • Features
    • Dynamic batch return top_logprobs.

0.3.30~31

  • Major Features
    • Add new GPTQ kernel.
  • Other Changes
    • Disable fuse weight for GPTQ model.
    • Add block index into prefix cache key.
  • Bug Fix
    • Fix parsing config when rope_scaling=None.

0.3.29

  • Bug Fix
  • Fix chat model eos_id.

0.3.28

  • Major Features
  • Support minicpm_moe.

0.3.27

  • Major Features
  • support Dynamic NTK-aware context length extention.

0.3.26

  • Bug Fix
    • cancel fix: invalid ref when python task obj destructed

0.3.25

  • Other Changes
    • loader supports AWQ models.
  • Bug Fix
    • fix long context crash up to 128k(no rope scaling yet).

0.3.25

  • Other Changes
    • loader supports AWQ models.
  • Bug Fix
    • fix long context crash up to 128k(no rope scaling yet).

0.3.24

  • Bug Fix
    • Fix Dragonfly residual scaling depth apply.

0.3.23

  • Major Features
    • Add AWQ kernel.

0.3.21,22

  • Bug Fix
    • Fix attention for long input.

0.3.20

  • Bug Fix
    • fix typecast of pass-in hidden_states.

0.3.19

  • Other Changes
    • Improved Paged KVCache page reuse and separate physical/logical paged lifecycle management.

0.3.18

  • Fix softmax kernel for long input.

0.3.17

  • Major Features
    • Copy-on-Write for paged attention, identical outputs of parallel sampling are stored in the same physical blocks.

0.3.16

  • Major Features
    • Support Paged Attention.
    • Support parallel sampling(num_results > 1) with shared prefix kv cache.

0.3.14

  • Major Features
    • Support AWQ model with exllama kernel.
  • Other Changes
    • exllama kernel uses float accumulator.
  • Bug fix
    • Dynamic batch with flash attention: Set len_k to full_input_len during encode.

0.3.14

  • Support presence_penalty.
  • Fix prompt cache with flash attention.

0.3.13

  • Fix stream decode for HF tokenizer.

0.3.12

  • Major Features
    • Support prompt cache.
    • flash decoding layer supports paged attention.
  • Config Changes
    • GeneratorArg: beam_size default to 1; random search default seed to 42.
    • Add QuantType GPTQ

0.3.11

  • Speed Optimizations
    • Using FlashAttention to copy buffer, first token latency reduces ~= 10%.

0.3.10

  • Speed Optimizations
    • Split KV Attention during decoding.
  • New Features
    • Dynamic batch support chat-format input.

0.3.09

0.3.08

  • Major Features
    • Support GPTQ int4 model.
    • Support CUDA12.

0.3.07

  • Major Features
    • Optimize MOE; MOE support INT8.
    • Optimize loading distributed parameters.

0.3.06

  • Bugfix for chatml, use CHATML_COMPATIBLE=0 to disable prepend <s>.

0.3.05

  • Other Changes
    • FlashDecoding now compiles against FlashAttention 2.5(with Paged Attention support).
    • Reduce libflash_decoding.a size to 255MB.

0.3.04

  • Major Features
    • Support MOE model: mixtral and cpmd_moe.
    • Add ModelLoader.convert_pt_to_safetensors() util method.

0.3.03

  • Major Features
    • Support mistral model.
    • Support load model safetensors.
    • Support load model multiple threads for multiple files.

0.3.02

  • Major Features
    • FlashDecoding ops now supports batching(enabled in static batch);

0.3.01

  • Major Features
    • Dynamic batch dragonfly model support.
    • rotary embedding support rope_theta config.

0.2.99

  • Other Changes
    • Loader adapts CPMLive checkpoints weights file naming.

0.2.98

  • Major Features
    • Dragonfly model support.
  • Other Changes
    • Loader interface changes, is_chatml option moved to config.json.
    • added an explicit model_type options to avoid ambiguous parameters caused misbehaviour.

0.2.97

  • Important Bug Fixes
    • Fix attention kernel bug introduced in 0.2.92.

0.2.96

  • Major Features
    • Dynamic batch support flash attention.
  • Major Optimizations
    • Optimize attention kernel for bfloat16.
  • Deprecated Features
    • Dynamic batch disable seed for beam search.

0.2.95

  • Major Features
    • Support flash attention.

0.2.94

  • Add first token delay to dynamic batch API.

0.2.93

  • Bug Fixes
    • Fix random search when num_results > 1.
    • Fix loader with chatml and support init tokenizer with tokens.

0.2.92

  • Major Optimizations
    • Optimize attention kernel.
      • Turn on custom kernel for single request.
      • Use non-mma version kernel.
  • Other changes
    • Change RawEmbedding parallel mode to row.
  • Bug Fixes
    • Fix random search when num_results > 1.
    • Fix config bug.

0.2.91

  • Speed Optimizations
    • Turn on CPM_FUSE_QKV and CPM_FUSE_FF_IN by default.
    • Optimize batch_generator.cpp: remove copy context.

0.2.90

  • Speed Optimizations
    • Use NCCL by default; remove cudaDeviceSynchronize before/after reduce sum.

0.2.89

  • Major Optimizations
    • Dynamic batch support timeout and cancel.
    • Fuse feedforward w_in with w_gated.

0.2.88

  • Bug Fixes
    • fix bf16 in speculative sampling

0.2.87

  • Major Features
    • Support Int4

0.2.86

  • Major Optimizations
    • Fuse int8 linear for QKV
  • Code Refactoring
    • Fuse NormalLinear with ParallelLinear
  • Bug Fixes
    • Fix random_search generate