Attention not working properly in FlashRobertaModel and FlashBertModel #694

sgiorgis · 2024-11-22T11:15:39Z

System Info

Operating System

Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04

Hardware used

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   23C    P8             36W /  300W |       1MiB /  23028MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   18C    P8             15W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Set the model to BAAI/bge-base-en-v1.5
Set the volume to volume=$PWD/data
Run lorax with docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512
Run one example curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json'
Run second example: curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json'
Run the same queries with hugginface directly following instruction from: or https://huggingface.co/BAAI/bge-base-en-v1.5

Same thing applies for: https://huggingface.co/BAAI/bge-reranker-v2-m3

Expected behavior

The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.

I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there:

lorax/server/lorax_server/models/custom_modeling/flash_bert_modeling.py

Line 165 in c0e5798

    
           attn_output = attention(q, k, v, None, None, cu_seqlens, max_s, self.softmax_scale, causal=False)

and here:

lorax/server/lorax_server/models/custom_modeling/flash_roberta_modeling.py

Line 98 in c0e5798

    
           attn_output = attention(q, k, v, None, None, cu_seqlens, max_s, self.softmax_scale, causal=False)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention not working properly in FlashRobertaModel and FlashBertModel #694

Attention not working properly in FlashRobertaModel and FlashBertModel #694

sgiorgis commented Nov 22, 2024

Attention not working properly in FlashRobertaModel and FlashBertModel #694

Attention not working properly in FlashRobertaModel and FlashBertModel #694

Comments

sgiorgis commented Nov 22, 2024

System Info

Operating System

Hardware used

Information

Tasks

Reproduction

Expected behavior