Skip to content

Commit

Permalink
update cmds in tutorials
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Park <[email protected]>
  • Loading branch information
peterjunpark committed Nov 5, 2024
1 parent ff72aa4 commit a7161d6
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 43 deletions.
28 changes: 14 additions & 14 deletions docs/tutorial/includes/infinity-fabric-transactions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ In our first experiment, we consider the simplest possible case, a

.. code-block:: shell-session
$ omniperf profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
$ rocprof-compute profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
Using:
mtype:CoarseGrained
mowner:Device
Expand All @@ -73,7 +73,7 @@ In our first experiment, we consider the simplest possible case, a
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/coarse_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -163,7 +163,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the

.. code-block:: shell-session
$ omniperf profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
$ rocprof-compute profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
Using:
mtype:FineGrained
mowner:Device
Expand All @@ -172,7 +172,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/fine_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -245,7 +245,7 @@ substantial change in the L2-Fabric metrics:

.. code-block:: shell-session
$ omniperf profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
$ rocprof-compute profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
Using:
mtype:FineGrained
mowner:Remote
Expand All @@ -254,7 +254,7 @@ substantial change in the L2-Fabric metrics:
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_remote/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/fine_grained_remote/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -339,7 +339,7 @@ fine-grained memory using the ``hipHostMalloc`` API:

.. code-block:: shell-session
$ omniperf profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
$ rocprof-compute profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
Using:
mtype:FineGrained
mowner:Host
Expand All @@ -348,7 +348,7 @@ fine-grained memory using the ``hipHostMalloc`` API:
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/fine_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -416,7 +416,7 @@ allocation as coarse-grained:

.. code-block:: shell-session
$ omniperf profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
$ rocprof-compute profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
Using:
mtype:CoarseGrained
mowner:Host
Expand All @@ -425,7 +425,7 @@ allocation as coarse-grained:
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/coarse_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -484,7 +484,7 @@ operations to fine-grained memory allocated on the host:

.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
$ rocprof-compute profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
Using:
mtype:FineGrained
mowner:Host
Expand All @@ -493,7 +493,7 @@ operations to fine-grained memory allocated on the host:
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_writes/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/fine_grained_host_writes/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down Expand Up @@ -576,7 +576,7 @@ operations to the CPU’s DRAM.

.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
$ rocprof-compute profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
Using:
mtype:FineGrained
mowner:Host
Expand All @@ -585,7 +585,7 @@ operations to the CPU’s DRAM.
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_add/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
$ rocprof-compute analyze -p workloads/fine_grained_host_add/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ and was run on an MI250 CDNA2 accelerator:

.. code-block:: shell
$ omniperf profile -n ipc --no-roof -- ./ipc
$ rocprof-compute profile -n ipc --no-roof -- ./ipc
The results shown in this section are *generally* applicable to CDNA
accelerators, but may vary between generations and specific products.
Expand Down Expand Up @@ -68,7 +68,7 @@ with ROCm Compute Profiler, we see:

.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 7 -b 11.2
$ rocprof-compute analyze -p workloads/ipc/mi200/ --dispatch 7 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -172,7 +172,7 @@ in the IPC example:

.. code-block:: shell
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 8 -b 11.2 --decimal 4
$ rocprof-compute analyze -p workloads/ipc/mi200/ --dispatch 8 -b 11.2 --decimal 4
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -265,7 +265,7 @@ Running this kernel through ROCm Compute Profiler yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 9 -b 11.2
$ rocprof-compute analyze -p workloads/ipc/mi200/ --dispatch 9 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -366,7 +366,7 @@ scalar register (``s0``). Running this kernel through ROCm Compute Profiler yiel

.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 10 -b 11.2
$ rocprof-compute analyze -p workloads/ipc/mi200/ --dispatch 10 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -430,7 +430,7 @@ through ROCm Compute Profiler yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 11 -b 11.2
$ rocprof-compute analyze -p workloads/ipc/mi200/ --dispatch 11 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down
10 changes: 5 additions & 5 deletions docs/tutorial/includes/lds-examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ v5.6.0, and ROCm Compute Profiler v2.0.0.
$ hipcc -O3 lds.hip -o lds
Finally, we generate our ``omniperf profile`` as:
Finally, we generate our ``rocprof-compute profile`` as:

.. code-block:: shell-session
$ omniperf profile -n lds --no-roof -- ./lds
$ rocprof-compute profile -n lds --no-roof -- ./lds
.. _lds-bandwidth:

Expand Down Expand Up @@ -71,7 +71,7 @@ Next, let’s analyze the first of our bandwidth kernel dispatches:

.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
$ rocprof-compute analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
Expand Down Expand Up @@ -172,7 +172,7 @@ see:

.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
$ rocprof-compute analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
Expand All @@ -196,7 +196,7 @@ Looking at the next ``conflicts`` dispatch (i.e., two work-items) yields:

.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
$ rocprof-compute analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
Expand Down
8 changes: 4 additions & 4 deletions docs/tutorial/includes/occupancy-limiters-example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Finally, we generate our ROCm Compute Profiler profile as:

.. code-block:: shell
$ omniperf profile -n occupancy --no-roof -- ./occupancy
$ rocprof-compute profile -n occupancy --no-roof -- ./occupancy
.. _occupancy-experiment-design:

Expand Down Expand Up @@ -101,7 +101,7 @@ the analyze step on this kernel:

.. code-block:: shell
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
$ rocprof-compute analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -226,7 +226,7 @@ Analyzing this:

.. code-block:: shell
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
$ rocprof-compute analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
<...>
--------------------------------------------------------------------------------
2. System Speed-of-Light
Expand Down Expand Up @@ -351,7 +351,7 @@ Analyzing this workload yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
$ rocprof-compute analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorial/includes/valu-arithmetic-instruction-mix.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,13 @@ Generate the profile for this example using the following command.

.. code-block:: shell
$ omniperf profile -n instmix --no-roof -- ./instmix
$ rocprof-compute profile -n instmix --no-roof -- ./instmix
Analyze the instruction mix section.

.. code-block:: shell
$ omniperf analyze -p workloads/instmix/mi200/ -b 10.2
$ rocprof-compute analyze -p workloads/instmix/mi200/ -b 10.2
<...>
10. Compute Units - Instruction Mix
10.2 VALU Arithmetic Instr Mix
Expand Down
24 changes: 12 additions & 12 deletions docs/tutorial/includes/vector-memory-operation-counting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ We have also chosen to include the ``--save-temps`` flag to save the
compiler temporary files, such as the generated CDNA assembly code, for
inspection.

Finally, we generate our ``omniperf profile`` as follows.
Finally, we generate our ``rocprof-compute profile`` as follows.

.. code-block:: shell-session
$ omniperf profile -n vmem --no-roof -- ./vmem
$ rocprof-compute profile -n vmem --no-roof -- ./vmem
.. _flat-experiment-design:

Expand Down Expand Up @@ -94,7 +94,7 @@ First, we demonstrate our simple ``global_write`` kernel:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 1 -b 10.3 15.1.4 15.1.5 15.1.6 15.1.7 15.1.8 15.1.9 15.1.10 15.1.11 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 1 -b 10.3 15.1.4 15.1.5 15.1.6 15.1.7 15.1.8 15.1.9 15.1.10 15.1.11 -n per_kernel
<...>
--------------------------------------------------------------------------------
0. Top Stat
Expand Down Expand Up @@ -208,7 +208,7 @@ Examining this kernel in the VMEM Instruction Mix table yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 10.3 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 2 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
Expand Down Expand Up @@ -264,7 +264,7 @@ access.

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 12.2.0 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 2 -b 12.2.0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
Expand Down Expand Up @@ -308,7 +308,7 @@ Running ROCm Compute Profiler on this kernel yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 3 -b 10.3 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 3 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
Expand Down Expand Up @@ -387,7 +387,7 @@ Running ROCm Compute Profiler on this kernel reports:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 4 -b 10.3 12.2.0 16.3.10 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 4 -b 10.3 12.2.0 16.3.10 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
Expand Down Expand Up @@ -472,7 +472,7 @@ Running ROCm Compute Profiler on this kernel yields:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 5 -b 10.3 16.3.12 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 5 -b 10.3 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
Expand Down Expand Up @@ -541,7 +541,7 @@ Running this kernel through ROCm Compute Profiler shows:

.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 6 -b 10.3 12.2.0 16.3.12 -n per_kernel
$ rocprof-compute analyze -p workloads/vmem/mi200/ --dispatch 6 -b 10.3 12.2.0 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
Expand Down Expand Up @@ -623,7 +623,7 @@ manner. See
for further reading on this instruction type.

We develop a `simple
kernel <https://github.com/ROCm/omniperf/blob/amd-mainline/sample/stack.hip>`__
kernel <https://github.com/ROCm/rocprofiler-compute/blob/amd-mainline/sample/stack.hip>`__
that uses stack memory:

.. code-block:: cpp
Expand Down Expand Up @@ -657,9 +657,9 @@ And profiled using ROCm Compute Profiler:

.. code-block:: shell-session
$ omniperf profile -n stack --no-roof -- ./stack
$ rocprof-compute profile -n stack --no-roof -- ./stack
<...>
$ omniperf analyze -p workloads/stack/mi200/ -b 10.3 16.3.11 -n per_kernel
$ rocprof-compute analyze -p workloads/stack/mi200/ -b 10.3 16.3.11 -n per_kernel
<...>
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
Expand Down

0 comments on commit a7161d6

Please sign in to comment.