System Tracing Individual Processes? #385

Allan-Luu · 2024-09-17T03:39:20Z

Allan-Luu
Sep 17, 2024

Hey there,

I'm currently working on a system for LLM inference where I have 16 AMD GPUs distributed evenly across 2 clusters.
My setup has cluster 1 (C1) running LLM inference, offloading the layers to local GPUs, and cluster 2 (C2) also receiving offloaded layers from C1 through some RPC servers running on each GPU on C2.

Now, with C2 running 8 servers (1 for each GPU) for C1 for to communicate with, the process is constantly running and waiting for C1 to send data to it.

Is there a way to trace the GPU performance of C1 and C2 when I run my LLM inference application? Since it's on 2 separate clusters, I'm assuming I'd need to run omnitrace on each cluster for a set period and let it listen to HIP/HSA events?

I'm thinking the trace time window example may be something I'm looking for. But I'm not sure if it's possible to incorporate my applications with this example.

I hope this makes sense, let me know if there's anything I can clarify further. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Tracing Individual Processes? #385

{{title}}

Replies: 0 comments

Select a reply

System Tracing Individual Processes? #385

Allan-Luu Sep 17, 2024

Replies: 0 comments

Allan-Luu
Sep 17, 2024