[DLRM v2] Using the model for the inference reference implementation #648

pgmpablo157321 · 2023-05-17T00:15:17Z

I am currently making the reference implementation and am stuck deploying the model in multiple GPUs.

Here is a link to the PR: mlcommons/inference#1373
Here is a link to the file where the model is: https://github.com/mlcommons/inference/blob/7c64689b261f97a4fc3410bff584ac2439453bcc/recommendation/dlrm_v2/pytorch/python/backend_pytorch_native.py

Currently this works for a debugging model and a single GPU, but fails when I try to run it with multiple ones. Here are the issues that I have:

If I run the benchmark, it gets stuck in this line. This is because you need to run that line for each rank, but I am not able to run it, load the model in the variable and store it there to query it.
Running the benchmark in CPU, I get the following error when making a prediction.

[...]'fbgemm' object has no attribute 'jagged_2d_to_dense' (this happens when importing torchrec)

or

[...]fbgemm object has no attribute 'bounds_check_indices' (this happens when making a prediction)

This can be because I am trying to load a sharded model in a different number of ranks. Do you know if that could be related if thats related?

I have tried with pytorch versions 1.12, 1.13, 2.0.0, 2.0.1 and fbgemm version 0.3.2 and 0.4.1

The text was updated successfully, but these errors were encountered:

yuankuns · 2023-05-17T00:24:55Z

Hi Pablo,
You need to install fbgemm-gpu-cpu==0.3.2 to avoid this error.

pgmpablo157321 · 2023-05-17T04:24:42Z

Already have this version, but the error persist

Name: fbgemm-gpu-cpu
Version: 0.3.2
Summary: 
Home-page: https://github.com/pytorch/fbgemm
Author: FBGEMM Team
Author-email: [email protected]
License: BSD-3
Location: /opt/conda/lib/python3.7/site-packages
Requires: 
Required-by:

yuankuns · 2023-05-17T14:49:33Z

Have you tried to remove fbgemm-gpu as well?

pgmpablo157321 · 2023-05-17T21:38:57Z

@yuankuns When i try to remove the fbgemm-gpu, the following import error:

ModuleNotFoundError: No module named 'fbgemm_gpu'

I managed to run the cpu version with fbgemm-gpu-cpu==0.3.2 fbgemm-gpu==0.4.1 pytorch==1.13.1 in a machine with gpu. Without gpu I get an fbgemm error like the ones I posted before

yuankuns · 2023-05-17T21:41:53Z

@pgmpablo157321 It's interesting, since there is no GPU on our server, and it (only fbgemm-gpu-cpu==0.3.2) work for our case.

ShriyaPalsamudram · 2024-07-31T20:10:12Z

@pgmpablo157321 is this still an issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DLRM v2] Using the model for the inference reference implementation #648

[DLRM v2] Using the model for the inference reference implementation #648

pgmpablo157321 commented May 17, 2023 •

edited

Loading

yuankuns commented May 17, 2023

pgmpablo157321 commented May 17, 2023

yuankuns commented May 17, 2023

pgmpablo157321 commented May 17, 2023

yuankuns commented May 17, 2023

ShriyaPalsamudram commented Jul 31, 2024

[DLRM v2] Using the model for the inference reference implementation #648

[DLRM v2] Using the model for the inference reference implementation #648

Comments

pgmpablo157321 commented May 17, 2023 • edited Loading

yuankuns commented May 17, 2023

pgmpablo157321 commented May 17, 2023

yuankuns commented May 17, 2023

pgmpablo157321 commented May 17, 2023

yuankuns commented May 17, 2023

ShriyaPalsamudram commented Jul 31, 2024

pgmpablo157321 commented May 17, 2023 •

edited

Loading