Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

feikiss · 2024-12-10T11:47:15Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The Llama.cpp runs very slow in ARM CPU (Kunpeng 920).

I pulled the docker image for arm and run an instance with setting 4 cores in same numa.
commands I use:

docker run --rm --net=host --cpuset-cpus="32-35" --cpuset-mems="1" -v /root/share/:/root/share/ ghcr.io/ggerganov/llama.cpp:server-b4226 -m /root/share/qwen2.5-0.5b-q8/qwen2.5-0.5b-instruct-q8_0.gguf

CPU info

Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       HiSilicon
Model name:                      Kunpeng-920
Model:                           0
Thread(s) per core:              1
Core(s) per cluster:             64
Socket(s):                       -
Cluster(s):                      4
Stepping:                        0x1
Frequency boost:                 disabled
CPU max MHz:                     2600.0000
CPU min MHz:                     200.0000
BogoMIPS:                        200.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                       16 MiB (256 instances)
L1i cache:                       16 MiB (256 instances)
L2 cache:                        128 MiB (256 instances)
L3 cache:                        256 MiB (8 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
NUMA node4 CPU(s):               128-159
NUMA node5 CPU(s):               160-191
NUMA node6 CPU(s):               192-223
NUMA node7 CPU(s):               224-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.46.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev152+g1f6584ee.d20241127
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

VLLM_CPU_KVCACHE_SPACE=1
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64:

result:


slot print_timing: id  0 | task 60 |
prompt eval time =    3636.68 ms /   126 tokens (   28.86 ms per token,    34.65 tokens per second)
       eval time =    4827.58 ms /    10 tokens (  482.76 ms per token,     2.07 tokens per second)
      total time =    8464.26 ms /   136 tokens
slot launch_slot_: id  0 | task 71 | processing task
slot update_slots: id  0 | task 71 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 123
slot update_slots: id  0 | task 71 | kv cache rm [0, end)
slot update_slots: id  0 | task 71 | prompt processing progress, n_past = 123, n_tokens = 123, progress = 1.000000
slot update_slots: id  0 | task 71 | prompt done, n_past = 123, n_tokens = 123
slot      release: id  0 | task 71 | stop processing: n_past = 132, truncated = 0
slot print_timing: id  0 | task 71 |
prompt eval time =    3569.76 ms /   123 tokens (   29.02 ms per token,    34.46 tokens per second)
       eval time =    4830.39 ms /    10 tokens (  483.04 ms per token,     2.07 tokens per second)

cpu usage:

CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O         PIDS
6e0062c2574d   nice_banzai   255.28%   158MiB / 61.34GiB   0.25%     0B / 0B       0B / 0B           513

Motivation

Hope the llama.cpp can run with a high performence, like 10 tokens/s : )

Possible Implementation

No response

The text was updated successfully, but these errors were encountered:

feikiss added the enhancement New feature or request label Dec 10, 2024

feikiss changed the title ~~Feature Request: Performance enhancement in ARM CPU (KunPeng 920)~~ Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

feikiss commented Dec 10, 2024 •

edited

Loading

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

Comments

feikiss commented Dec 10, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

feikiss commented Dec 10, 2024 •

edited

Loading