Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lang] Revisit memory model #321

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
83 changes: 72 additions & 11 deletions specs/language/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -273,21 +273,82 @@

\Sec{\acrshort{hlsl} Memory Models}{Intro.Memory}

\p Memory accesses for \gls{sm} 5.0 and earlier operate on 128-bit slots aligned
on 128-bit boundaries. This optimized for the common case in early shaders where
data being processed on the GPU was usually 4-element vectors of 32-bit data
types.

\p On modern hardware memory access restrictions are loosened, and reads of
32-bit multiples are supported starting with \gls{sm} 5.1 and reads of 16-bit
multiples are supported with \gls{sm} 6.0. \gls{sm} features are fully
documented in the \gls{dx} Specifications, and this document will not attempt to
elaborate further.
\p The fundamental storage unit in HLSL is a \textit{byte}, which is comprised
of 8 \textit{bits}. Each \textit{bit} stores a single value 0 or 1. Each byte
has a unique \textit{address}.

\p A \textit{memory location} is a range of bytes which can be identified by an
address and a length. A memory location represents either a scalar object, or a
sequence of adjacent bit-fields of non-zero size.

\p Each read or write to a memory location is called a \textit{memory access}.
Operations that perform memory accesses are called \textit{memory operations}. A
memory operation may operate on one or more memory locations. A memory operation
must not alter memory at a location not contained in the set of memory locations
it is operating on.

\begin{note}
The memory location of a bit-field may include adjacent bit-fields. For
example given the following declaration:

\begin{HLSL}
struct ContainingBitfields {
uint A : 4;
uint B : 4;
uint : 0;
uint D : 4;
}
\end{HLSL}

Members \texttt{A}, and \texttt{B} have the same memory locations comprised of
4 bytes beginning at the start of the structure. The zero-sized anonymous
bit-field member causes a break in bit-field packing, so member \texttt{D}
occupies the next set of memory locations beginning at the 5th byte of the
structure and continuing for 4 bytes. For more description of bit-fields see
\ref{Classes.BitFields}.
\end{note}

\p Padding bytes inside a structure are included in the memory location of the
structure, but are not included in the memory locations of the members inside
the structure. This means that element-wise operations like default copy
operations do not copy padding bytes. Because structure padding is
implementation defined, and reading or writing padding bytes is undefined
behavior, an implementation may generate writes that overwrite padding bytes.

\p Reading from uninitialized memory is undefined. Writing uninitialized values
to memory is undefined.

\p Two sets of memory locations, \texttt{A} and \texttt{B}, are said to
\textit{overlap} each other if some memory location in \texttt{A} is also in
\texttt{B} (\(A \cap B \neq \emptyset\)).

\Sub{Memory Spaces}{Intro.Memory.Spaces}

\p \acrshort{hlsl} programs manipulate data stored in four distinct memory
spaces: thread, threadgroup, device and constant.
spaces: thread, threadgroup, device and constant. Memory spaces are logical
abstractions over physical memory. Each memory space has a defined \textit{line
width}, which specifies the minimum readable and writable size, and a
\textit{minimum alignment}, which defines the smallest addressable increment of
the memory space. The two values need not be the same, although they may be.

\begin{note}
\p Memory accesses for many resource types in \gls{dx} operate on 128-bit
slots aligned on 128-bit boundaries. In the terms of this specification it
would be said that those memory spaces have a 128-bit \textit{line width},
and a 128-bit \textit{minimum alignment}.
\end{note}
Comment on lines +329 to +339
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few constraints around memory accesses in HLSL and DXIL that you're trying to abstract over here, but I'm not sure the "line width" idea captures them effectively. In some sense it might seem nice to boil down some similar rules into a simple concept, but it's worth noting why the rules are what they are and how they might change.

  1. "Legacy" cbuffer and tbuffer layout. That is, the only cbuffer layout. Here, we have a constraint that came from 16-byte DXBC registers. The cbufferLoadLegacy docs call this a "row" in a comment, but I don't know that there's ever been any official terminology. Here, the rules on how big a single object or element of an array can be (128 bits) come from the packing rules, and it would arguably better to just write a section on those rules akin to the notes in maraneshi's layout visualizer rather than try to discuss this as a general rule about access size.
  1. Data access via TypedBuffer. This is presumably where the "line width" idea comes from, but a lot of its complexity is unnecessary if we disallow "types that happen to fit" as type arguments to Buffer<>. Here, we have accesses to typed buffers and textures, and the operation that accesses them operates on a 4-element contained type. A Buffer<float> is really a Buffer<float4> that we only use one element from.

    This gets a bit confusing for 64 bit types. Notably, Buffer<double4> is not valid HLSL. However, this is really an implementation detail leaking through since the storage actually splits doubles up into int32 parts. So it's probably better to just think of Buffer<double2> as syntactic sugar for the casts and just call this kind of memory access what it is - access into a container of 4 at-most 32-bit values.

  2. Vectors of more than 4 elements don't exist in HLSL. This is simply due to the fact that there's a fixed set of vector types and no way for a user to create their own. It isn't a meaningful rule, and in spaces like local device memory we really don't need any constraints on the language here. If it's possible to write a double8 somehow in the future in the language and that isn't used in constant or typed buffers specifically, it's straightforward for implementations to do whatever they need to do to lower it. I don't think we want to define an artificial limit there.

So I guess TLDR I think we should simply say two separate things rather than trying to define "line width":

  • Objects in Constant Buffer Memory are laid out according to the constant buffer packing rules (to be defined later). Elements of structures and arrays in this layout cannot exceed 128 bits in size.
  • Memory accesses into typed buffers are defined to access 4 elements of at-most 32-bit values. (Also possibly a note about emulation 64 bit values, though this may not belong in this section)

Also note that I use "constant buffer" memory in my wording above, rather than "constant memory". We may want to keep that terminology available for if we ever do something in that space that doesn't carry the constant buffer legacy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

I think part of the problem is trying to view all resource accesses as if they are like native memory accesses from the shader, with only the address space placing constraints on alignment and such. I think we can evolve constant memory and raw/structured buffer memory in this direction, but not typed/texture accesses. For memory that goes in this direction, I don't think "line width" would be a concept we want to use/keep, and "minimum alignment" will be defined in other ways, rather than by some fixed value applied to a memory type.

Some notes:

Elements of structures and arrays in this layout cannot exceed 128 bits in size.

I don't think I would agree with that. First, it's a confusing use of the term "element" here. Perhaps you had a different definition of "element" in mind than what I am interpreting here, but I struggle to think of a single definition that fits into this statement. Plenty of elements of structures and arrays in HLSL that exceed 128-bits in size can be placed into the constant buffer. You can declare a double4 (or array of such) in a constant buffer, which will use two rows for the vector. It's just that structures, array elements, and any type that cannot fit within the remainder of a row will be started at the beginning of the next available row. For some of these, that's part of the high-level packing rules, not necessarily something intrinsic to the DXIL interface. For array elements, they must be 128-bit aligned to ensure that array indexing maps to an index in the DXIL legacy constant buffer load op without impacting the index of the component read from the result.

For legacy constant buffer load in DXIL, it's important to note that this load op doesn't mean all of the components are loaded - only the components that are extracted from the result structure need to be loaded. It's a subtle difference, but important in certain circumstances, and mismatches the concept of "line width" as applied to constant buffers. Think of the DXIL op as a compromise as there wasn't an easy way to express the thing that's expressed easily in DXBC asm like so: CB0[0][0].yyyz (only loads y and z components).


\p Each address has an associated memory space. Two addresses with the same
value but different memory spaces are considered different unique addresses.

\p A memory location in any memory space may overlap with another memory
location in the same space. A memory location in thread or threadgroup memory
may not overlap with memory locations in any other memory spaces\footnote{The
physical memory regions for thread and threadgroup memory are required to be
distinct and non-overlapping with any other memory space.}. It is implementation
defined if memory locations in other memory spaces overlap with memory locations
in different spaces\footnote{An implementation may define device, constant or
additional extended memory spaces to share logical address ranges.}.

\SubSub{Thread Memory}{Intro.Memory.Spaces.Thread}

Expand Down
1 change: 1 addition & 0 deletions specs/language/placeholders.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
\Ch{Classes}{Classes}
\Sec{Static Members}{Classes.Static}
\Sec{Conversions}{Classes.Conversions}
\Sec{Bit-fields}{Classes.BitFields}
\Ch{Templates}{Template}
\Sec{Template Instantiation}{Template.Inst}
\Sec{Partial Ordering of Function Templates}{Template.Func.Order}
Expand Down
Loading