Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question in testing latency #3

Open
alkane7 opened this issue Dec 14, 2024 · 2 comments
Open

Question in testing latency #3

alkane7 opened this issue Dec 14, 2024 · 2 comments

Comments

@alkane7
Copy link

alkane7 commented Dec 14, 2024

Hi! Thank you for your great work!
We are trying to test the latency of MagicPIG. Also, we adjust the hyperparameter K and L to K=10 and L=100 in order to find a better latency.
However, we find that TTFT and TPOT seem high on InfiniteBench.
Our partial results are as follows:

        code_debug    code_run 
TTFT(s)    630.37       140   
TPOT(s)    1.246       0.84
for k in range(GEN_LEN):
  st = time.time()
  input_ids = logits.argmax(dim=-1)
  logits = llm.inference(input_ids=input_ids, position_ids=position_ids[:,PREFIX_LEN + k:PREFIX_LEN + k + 1])
  output.append(input_ids.item())
  en = time.time()
  total_decode_time.append(en-st)
  if input_ids.item() in config["eos"]:
      break
TPOT = sum(total_decode_time) / len(total_decode_time)

We would like to know the reason of high latency and if there is any error in our implementation.

@dreaming-panda
Copy link
Contributor

dreaming-panda commented Dec 15, 2024

Usually, there are several reasons resulting in high latency

  1. Other programs are running on the CPUs. Currently, we bind the attention computation to 64 CPU cores, if any of them are occupied or shared by other processes, the latency will be high.
  2. Check the number of physical cores of your cluster. We use 64 cores by default, if your physical cores (not hyper-thread) are fewer than 64, you need to manually decrease the OpenMP threads in lsh.cc and sparse_attention.cc.

The performance of MagicPIG largely depends on the status of CPUs.

Besides, can you run the models/bench.sh file? We add this in v0.2 branch.

@dreaming-panda
Copy link
Contributor

In some cases, when you cannot use numactl, add OMP_NUM_THREADS=64 can help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants