You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 23C P8 36W / 300W | 1MiB / 23028MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 18C P8 15W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 18C P8 16W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 18C P8 16W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Set the model to BAAI/bge-base-en-v1.5
Set the volume to volume=$PWD/data
Run lorax with docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512
Run one example curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json'
Run second example: curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json'
The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.
I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there:
System Info
Operating System
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Hardware used
Information
Tasks
Reproduction
BAAI/bge-base-en-v1.5
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512
curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json'
curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json'
Same thing applies for: https://huggingface.co/BAAI/bge-reranker-v2-m3
Expected behavior
The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.
I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there:
lorax/server/lorax_server/models/custom_modeling/flash_bert_modeling.py
Line 165 in c0e5798
and here:
lorax/server/lorax_server/models/custom_modeling/flash_roberta_modeling.py
Line 98 in c0e5798
The text was updated successfully, but these errors were encountered: