Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

Open
4 tasks done
feikiss opened this issue Dec 10, 2024 · 0 comments
Open
4 tasks done

Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) #10754

feikiss opened this issue Dec 10, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@feikiss
Copy link

feikiss commented Dec 10, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The Llama.cpp runs very slow in ARM CPU (Kunpeng 920).

I pulled the docker image for arm and run an instance with setting 4 cores in same numa.
commands I use:

docker run --rm --net=host --cpuset-cpus="32-35" --cpuset-mems="1" -v /root/share/:/root/share/ ghcr.io/ggerganov/llama.cpp:server-b4226 -m /root/share/qwen2.5-0.5b-q8/qwen2.5-0.5b-instruct-q8_0.gguf

CPU info

Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       HiSilicon
Model name:                      Kunpeng-920
Model:                           0
Thread(s) per core:              1
Core(s) per cluster:             64
Socket(s):                       -
Cluster(s):                      4
Stepping:                        0x1
Frequency boost:                 disabled
CPU max MHz:                     2600.0000
CPU min MHz:                     200.0000
BogoMIPS:                        200.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                       16 MiB (256 instances)
L1i cache:                       16 MiB (256 instances)
L2 cache:                        128 MiB (256 instances)
L3 cache:                        256 MiB (8 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
NUMA node4 CPU(s):               128-159
NUMA node5 CPU(s):               160-191
NUMA node6 CPU(s):               192-223
NUMA node7 CPU(s):               224-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.46.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev152+g1f6584ee.d20241127
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

VLLM_CPU_KVCACHE_SPACE=1
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64:

result:


slot print_timing: id  0 | task 60 |
prompt eval time =    3636.68 ms /   126 tokens (   28.86 ms per token,    34.65 tokens per second)
       eval time =    4827.58 ms /    10 tokens (  482.76 ms per token,     2.07 tokens per second)
      total time =    8464.26 ms /   136 tokens
slot launch_slot_: id  0 | task 71 | processing task
slot update_slots: id  0 | task 71 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 123
slot update_slots: id  0 | task 71 | kv cache rm [0, end)
slot update_slots: id  0 | task 71 | prompt processing progress, n_past = 123, n_tokens = 123, progress = 1.000000
slot update_slots: id  0 | task 71 | prompt done, n_past = 123, n_tokens = 123
slot      release: id  0 | task 71 | stop processing: n_past = 132, truncated = 0
slot print_timing: id  0 | task 71 |
prompt eval time =    3569.76 ms /   123 tokens (   29.02 ms per token,    34.46 tokens per second)
       eval time =    4830.39 ms /    10 tokens (  483.04 ms per token,     2.07 tokens per second)

cpu usage:

CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O         PIDS
6e0062c2574d   nice_banzai   255.28%   158MiB / 61.34GiB   0.25%     0B / 0B       0B / 0B           513

Motivation

Hope the llama.cpp can run with a high performence, like 10 tokens/s : )

Possible Implementation

No response

@feikiss feikiss added the enhancement New feature or request label Dec 10, 2024
@feikiss feikiss changed the title Feature Request: Performance enhancement in ARM CPU (KunPeng 920) Feature Request: Performance enhancement in ARM CPU (Kunpeng 920) Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant