Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sycl][cuda][hip] Expose const addrsp via device_global<T, decltype(properties{device_constant})> #16001

Draft
wants to merge 38 commits into
base: sycl
Choose a base branch
from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Nov 6, 2024

Map cuda/hip const addrspace device global variable (__constant__) to device_global<T, decltype(properties{device_constant})>

Nvidia GPUs have a dedicated constant memory cache which can be a lot faster in some cases for constant global device variables ("constant cuda symbols"). CUDA programmers access this cache via global variables marked __constant__
AMD GPUs do not have a dedicated constant memory cache (as far as I am aware). However the HIP programming model does support __constant__. As well as supporting the constant cache in the Nvidia case, when AMD GPUs are the target the macro can be used as a compiler hint for other optimizations such as using SGPRs (scalar registers) instead of VGPRs (vector registers).

This patch switches on these optimizations for cuda/hip backends of dpc++.

This is a natural translation that allows the complete support of device_global features under the constraint that programmers cannot update the device_global<T, decltype(properties{device_constant})> in kernel code (matching __constant__ semantics in cuda/hip), whilst still allowing them to update this constant global variable via queue::memcpy(const device_global), which maps naturally to how CUDA APIs allows programmers to update __constant__ device symbols via the host.

Key applications that have been identified will benefit from this:

Kokkos (general)
Blender
NWCHEMEX aka Exachem

Fixes #5827
Fixes #4278

Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
template <typename propertyT> static constexpr auto get_property() { \
return property_list_t::template get_property<propertyT>(); \
}

template <typename T, typename... Props>
class
#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::global_variable_allowed, __sycl_detail__::device_global,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if device_global should take an optional argument to select the address space instead of a new device_constant. If we do, you could use some meta-programming to select the address space here, that could limit the impact on the headers.

Copy link
Contributor

@steffenlarsen steffenlarsen Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even do that with the current property. It should be as simple as doing a std::conditional_t<property_list_t::template has_property<device_constant_key>(), __OPENCL_CONSTANT_AS__ T *, T *>, inheriting the definition of __OPENCL_CONSTANT_AS__ from sycl/include/sycl/access/access.hpp. Assuming the PTX and AMDGCN know how to handle __attribute__((opencl_constant)), that should hopefully avoid the need for the new clang attribute.

As for the member availability, this could be done through either conditionally picking base classes or SFINAE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even do that with the current property. It should be as simple as doing a std::conditional_t<property_list_t::template has_property<device_constant_key>(), __OPENCL_CONSTANT_AS__ T *, T *>, inheriting the definition of __OPENCL_CONSTANT_AS__ from sycl/include/sycl/access/access.hpp. Assuming the PTX and AMDGCN know how to handle __attribute__((opencl_constant)), that should hopefully avoid the need for the new clang attribute.

As for the member availability, this could be done through either conditionally picking base classes or SFINAE.

Doing address space declarations directly in source code is currently not allowed by SEMA: I get e.g. (same if field is pointer type):

error: field may not be qualified with an address space
  100 |   T __attribute__((opencl_constant)) val{};

I looked into changing this behaviour, but I didn't think there was a simple solution.
My current idea is to partially specialize as

    device_global<
        T, detail::properties_t<Props...>,
        typename std::enable_if_t<(detail::properties_t<Props...>::
                               template has_property<device_constant_key>())>>
    : public detail::device_global_base<T, detail::properties_t<Props...>>

such that when the property device_constant is used we add a clang attribute to the class :

__sycl_detail__::device_constant

That the compiler then uses to manually set the address space to .const only for cuda/hip backends.

This I think in theory should be compatible with the partial specializations of the device_global_base class; to allow simultaneous specializations for the case when device_image_scope property is in the property list (via device_global_base): The only property that can be used in combination with the device_constant property should be device_image_scope I think (I will update the specification doc once the implementation is finalised).

@steffenlarsen @Naghasan Maybe there is a better solution though?

Once this is done there should also probably be a few more tests added to check all valid combined functionality of the device_constant property with the device_image_scope property. There should also be a test checking that the compiler does not allow writing to a device_global with the device_constant property (apart from via sycl::hander::/queue::memcpy etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need the attribute, could we maybe make it apply to the field instead of the device-global class then? I.e. that way we can use the conditional solution we previously discussed, but with the new attribute.

As a thought experiment, can we think of a case where a const global variable (including const fields of non-const global variables) would not want the variables to be in the .const namespace for NVPTX and AMDGCN? If not, could we maybe make the address-space decision based on that?

Copy link
Contributor Author

@JackAKirk JackAKirk Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need the attribute, could we maybe make it apply to the field instead of the device-global class then? I.e. that way we can use the conditional solution we previously discussed, but with the new attribute.

Yeah this sounds like it might be a better solution. I'll look into doing this. Thanks for the input

As a thought experiment, can we think of a case where a const global variable (including const fields of non-const global variables) would not want the variables to be in the .const namespace for NVPTX and AMDGCN? If not, could we maybe make the address-space decision based on that?

I am not sure if it can be an issue for NVPTX and AMDGCN backends, but in theory you can run out of .const memory space (It is normally limited to 4kb for NVPTX), and hence a user may want to be careful about which variables to put in .const space. If it is not a real issue in these backends, it could be an issue in other as yet unsupported backends. This is why we do not wan to have device_global<const T> imply .const address space, and instead have the explicit device_global<T, decltype(properties{device_constant})> solution, which will be functionally identical to device_global<const T, decltype(properties{device_constant})>:

device_constant implies const T but const T shouldn't imply .const address space.

template <typename propertyT> static constexpr auto get_property() { \
return property_list_t::template get_property<propertyT>(); \
}

template <typename T, typename... Props>
class
#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::global_variable_allowed, __sycl_detail__::device_global,
Copy link
Contributor

@steffenlarsen steffenlarsen Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even do that with the current property. It should be as simple as doing a std::conditional_t<property_list_t::template has_property<device_constant_key>(), __OPENCL_CONSTANT_AS__ T *, T *>, inheriting the definition of __OPENCL_CONSTANT_AS__ from sycl/include/sycl/access/access.hpp. Assuming the PTX and AMDGCN know how to handle __attribute__((opencl_constant)), that should hopefully avoid the need for the new clang attribute.

As for the member availability, this could be done through either conditionally picking base classes or SFINAE.

struct device_constant_key
: detail::compile_time_property_key<detail::PropKind::DeviceConstant> {
using value_t = property_value<device_constant_key>;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this new property and its effects on the device_global class documented anywhere?

[NOTE]
====
If _T_ is `const` then implementations may choose to allocate the `device_global` in a dedicated constant address space as an optimization. When using the {dpcpp} compiler with the CUDA or HIP backend, declaring a `device_global<const T>` is equivalent to declaring a `$$__constant__$$` variable.
====
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like it corresponds to what is implemented. It could be done in tandem by making all conditional behavior dependent on the new property dependent on the disjunction of that and std::is_const_v<T>. That said, if that is the case we also need to document how that affects the members of the device_global class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finalised the implementation yet (I'm just testing the draft requested changes atm), so I haven't updated the documentation which you are right is completely out of date. I described this here: #16001 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention me with a comment like "spec ready for review" when it is ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention me with a comment like "spec ready for review" when it is ready?

Sure, no problem.

@JackAKirk JackAKirk marked this pull request as draft December 11, 2024 10:54
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support buffer location on CUDA Constant memory optimization for CUDA backend
9 participants