Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

fnobis · 2023-08-04T12:20:16Z

I run a multiline statment, calling onto a PyTorch network which runs in about 50ms. When measuring this with line_profiler, the measured time is strangely low. When putting the code in one line, the measurement seems to be ok.
In the multiline case, the number of hits is given as 2 even though the model and all other lines in the code are only run once.

Wrong time calculated, multiline

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   194         2          1.5      0.8      0.0              output_1, output_2 = self.model(
   195         2          0.7      0.3      0.0                  tensor_in, **parameters
   196                                                       )

Correct time calculated and hits number, single line

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   194         1      50721.4  50721.4      4.5              output_1, output_2  = self.model(tensor_in, **parameters)

The text was updated successfully, but these errors were encountered:

Erotemic · 2023-08-04T15:29:20Z

Strange, can you make a MWE that reproduces this with a simple torch model?

Theelx · 2023-08-04T16:04:48Z

That is indeed interesting. I was working with using line_profiler on PyTorch models a few weeks ago and I didn't notice this issue, but I could easily have been doing something wrong.

Theelx · 2023-08-04T16:07:14Z

Also, in addition to an MWE, can you give us your platform (Windows vs Mac vs Linux), Python version, and line_profiler version? This may be related to #210, which is fixed if you install from this git repo but hasn't been officially released yet. So, maybe running your code with the version of line_profiler in this repo could help.

tmm1 · 2023-08-21T19:05:25Z

You need to make sure to set CUDA_LAUNCH_BLOCKING=1 for accurate results, otherwise the cuda kernels are running async and all the time will accumulate at the wrong line whenever there's a cuda sync happening.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

fnobis commented Aug 4, 2023

Erotemic commented Aug 4, 2023

Theelx commented Aug 4, 2023

Theelx commented Aug 4, 2023

tmm1 commented Aug 21, 2023

Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

Profiling a multiline statement with Pytorch/CUDA usage results in very low time allocation #223

Comments

fnobis commented Aug 4, 2023

Erotemic commented Aug 4, 2023

Theelx commented Aug 4, 2023

Theelx commented Aug 4, 2023

tmm1 commented Aug 21, 2023