-
Notifications
You must be signed in to change notification settings - Fork 166
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reorganize PTX docs to match generator (#2929)
Co-authored-by: Allard Hendriksen <[email protected]>
- Loading branch information
1 parent
cee542b
commit ee46f3e
Showing
54 changed files
with
1,616 additions
and
1,583 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
.. _libcudacxx-ptx-instructions-barrier-cluster: | ||
|
||
barrier.cluster | ||
=============== | ||
|
||
- PTX ISA: | ||
`barrier.cluster <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-barrier-cluster>`__ | ||
|
||
Similar functionality is provided through the builtins | ||
``__cluster_barrier_arrive(), __cluster_barrier_arrive_relaxed(), __cluster_barrier_wait()``, | ||
as well as the ``cooperative_groups::cluster_group`` | ||
`API <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cluster-group>`__. | ||
|
||
The ``.aligned`` variants of the instructions are not exposed. | ||
|
||
.. include:: generated/barrier_cluster.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
.. _libcudacxx-ptx-instructions-cp-async-bulk: | ||
|
||
cp.async.bulk | ||
============= | ||
|
||
- PTX ISA: | ||
`cp.async.bulk <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`__ | ||
|
||
Implementation notes | ||
-------------------- | ||
|
||
**NOTE.** Both ``srcMem`` and ``dstMem`` must be 16-byte aligned, and | ||
``size`` must be a multiple of 16. | ||
|
||
Changelog | ||
--------- | ||
|
||
- In earlier versions, ``cp_async_bulk_multicast`` was enabled for | ||
SM_90. This has been changed to SM_90a. | ||
|
||
|
||
Unicast | ||
------- | ||
|
||
.. include:: generated/cp_async_bulk.rst | ||
|
||
Multicast | ||
--------- | ||
|
||
.. include:: generated/cp_async_bulk_multicast.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
.. _libcudacxx-ptx-instructions-cp-async-bulk-tensor: | ||
|
||
cp.async.bulk.tensor | ||
==================== | ||
|
||
- PTX ISA: | ||
`cp.async.bulk.tensor <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__ | ||
|
||
Changelog | ||
--------- | ||
|
||
- In earlier versions, ``cp_async_bulk_tensor_multicast`` was enabled | ||
for SM_90. This has been changed to SM_90a. | ||
|
||
Unicast | ||
------- | ||
|
||
.. include:: generated/cp_async_bulk_tensor.rst | ||
|
||
Multicast | ||
--------- | ||
|
||
.. include:: generated/cp_async_bulk_tensor_multicast.rst |
9 changes: 9 additions & 0 deletions
9
docs/libcudacxx/ptx/instructions/cp_async_bulk_wait_group.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
.. _libcudacxx-ptx-instructions-cp-async-bulk-wait_group: | ||
|
||
cp.async.bulk.wait_group | ||
======================== | ||
|
||
- PTX ISA: | ||
`cp.async.bulk.wait_group <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-wait-group>`__ | ||
|
||
.. include:: generated/cp_async_bulk_wait_group.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
.. _libcudacxx-ptx-instructions-cp-reduce-async-bulk: | ||
|
||
cp.reduce.async.bulk | ||
==================== | ||
|
||
- PTX ISA: | ||
`cp.reduce.async.bulk <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk>`__ | ||
|
||
|
||
Integer and floating point instructions | ||
--------------------------------------- | ||
|
||
.. include:: generated/cp_reduce_async_bulk.rst | ||
|
||
Emulation of ``.s64`` instruction | ||
--------------------------------- | ||
|
||
PTX does not currently (CTK 12.3) expose | ||
``cp.reduce.async.bulk.add.s64``. This exposure is emulated in | ||
``cuda::ptx`` using: | ||
|
||
.. code:: cuda | ||
// cp.reduce.async.bulk.dst.src.mbarrier::complete_tx::bytes.op.u64 [dstMem], [srcMem], size, [rdsmem_bar]; // 2. PTX ISA 80, SM_90 | ||
// .dst = { .shared::cluster } | ||
// .src = { .shared::cta } | ||
// .type = { .s64 } | ||
// .op = { .add } | ||
template <typename=void> | ||
__device__ static inline void cp_reduce_async_bulk( | ||
cuda::ptx::space_cluster_t, | ||
cuda::ptx::space_shared_t, | ||
cuda::ptx::op_add_t, | ||
int64_t* dstMem, | ||
const int64_t* srcMem, | ||
uint32_t size, | ||
uint64_t* rdsmem_bar); | ||
// cp.reduce.async.bulk.dst.src.bulk_group.op.u64 [dstMem], [srcMem], size; // 6. PTX ISA 80, SM_90 | ||
// .dst = { .global } | ||
// .src = { .shared::cta } | ||
// .type = { .s64 } | ||
// .op = { .add } | ||
template <typename=void> | ||
__device__ static inline void cp_reduce_async_bulk( | ||
cuda::ptx::space_global_t, | ||
cuda::ptx::space_shared_t, | ||
cuda::ptx::op_add_t, | ||
int64_t* dstMem, | ||
const int64_t* srcMem, | ||
uint32_t size); | ||
FP16 instructions | ||
----------------- | ||
|
||
.. include:: generated/cp_reduce_async_bulk_f16.rst | ||
|
||
BF16 instructions | ||
----------------- | ||
|
||
.. include:: generated/cp_reduce_async_bulk_bf16.rst |
9 changes: 9 additions & 0 deletions
9
docs/libcudacxx/ptx/instructions/cp_reduce_async_bulk_tensor.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
.. _libcudacxx-ptx-instructions-cp-reduce-async-bulk-tensor: | ||
|
||
cp.reduce.async.bulk.tensor | ||
=========================== | ||
|
||
- PTX ISA: | ||
`cp.reduce.async.bulk.tensor <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`__ | ||
|
||
.. include:: generated/cp_reduce_async_bulk_tensor.rst |
Oops, something went wrong.