Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attention not working properly in FlashRobertaModel and FlashBertModel #694

Open
2 of 4 tasks
sgiorgis opened this issue Nov 22, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@sgiorgis
Copy link

System Info

Operating System

Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04

Hardware used

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   23C    P8             36W /  300W |       1MiB /  23028MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   18C    P8             15W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Set the model to BAAI/bge-base-en-v1.5
  2. Set the volume to volume=$PWD/data
  3. Run lorax with docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512
  4. Run one example curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json'
  5. Run second example: curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json'
  6. Run the same queries with hugginface directly following instruction from: or https://huggingface.co/BAAI/bge-base-en-v1.5

Same thing applies for: https://huggingface.co/BAAI/bge-reranker-v2-m3

Expected behavior

The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.

I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there:

attn_output = attention(q, k, v, None, None, cu_seqlens, max_s, self.softmax_scale, causal=False)

and here:

attn_output = attention(q, k, v, None, None, cu_seqlens, max_s, self.softmax_scale, causal=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant