-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alias barriers; a replacement for the ICU hack #67
Comments
On 06/03/2021 07.29, Tom Honermann wrote:
ICU defines a |U_ALIASING_BARRIER| <https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36> macro that is used to allow ICU to use |char16_t| internally while also providing interfaces that work with text stored in |wchar_t| (when it is a 16-bit type) or |uint16_t| (when available) without having to copy the text to and from |char16_t| based storage. This is important for efficient operation on Windows and with other libraries that use UTF-16 internally, but that do not use |char16_t| as their UTF-16 character type.
For most compilers, the |U_ALIASING_BARRIER| macro is a no-op and ICU relies on the compiler not taking advantage of |char16_t| being a distinct non-aliasing type of the other ICU supported UTF-16 character types.
That is a daring approach, and I'm flabbergasted that it appears to work
for "most compilers".
For Clang and gcc, ICU defines the macro as follows and invokes it immediately before using |reinterpret_cast| to convert between pointers to |char16_t| and other supported UTF-16 character types. The (volatile) inline assembly prevents the optimizer from reordering loads and stores across the inline assembly and the "memory" clobber informs the compiler that memory read before the inline assembly must be re-read, thus forming a read/write memory barrier. See the gcc documentation <https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html> for more details.
|#define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory") |
The introduction of |char8_t| as a non-aliasing type in C++20 creates a similar need for some form of an alias barrier that allows limited interchange between libraries that use |char8_t| for UTF-8 data internally and those that use |char| or |unsigned char| for UTF-8 data internally. Though the same problem applies in principle for |char8_t| with respect to |char16_t|, in practice this is less of a concern because |char| and |unsigned char| are aliasing types.
Converting a pointer to one type to a pointer to another unrelated type requires use of |reinterpret_cast| and that prevents performing such conversions in constant expressions and, likely, introduces UB. An alias barrier could potentially allow such conversions in constant expressions between types that meet certain compatibility requirements; for example, a common underlying type.
The proper approach is not to have an alias barrier and a
reinterpret_cast as independent things, but to have an
underlying_sibling_cast<T*>(x) (or similar) that tells the
compiler, in a more targeted manner, that x and T* may now
alias. The problem is whether/when the scope of such aliasing
ends; if the pointer escapes, we'd poison the entire program.
Jens
|
It is, and I suspect it does not actually suffice for "most compilers". My "most compilers" statement was derived from the fact that ICU defines the ICU's platform support is documented here. Markus Scherer has reported discussing alias concerns with Microsoft engineers where he was assured that Visual C++ will never treat Outside of Windows, gcc and Clang probably cover most of the real world use of ICU these days. I suspect that, even where ICU is using the |
With Richard's example code,
only clang optimizes |
In off-list discussion, Richard Smith noted that P0593R6 discusses a |
This issue was discussed in the context of P2626R0 ( No polls were taken, but it is clear that we need to get a better understanding of core language limitations to make further progress on this issue. |
So looking again at the GCC code, I see char8_t was handled here: But when char16_t was added: Was not done the same. It is conseratively correct. Let me file a bug. |
ICU defines a
U_ALIASING_BARRIER
macro that is used to allow ICU to usechar16_t
internally while also providing interfaces that work with text stored inwchar_t
(when it is a 16-bit type) oruint16_t
(when available) without having to copy the text to and fromchar16_t
based storage. This is important for efficient operation on Windows and with other libraries that use UTF-16 internally, but that do not usechar16_t
as their UTF-16 character type.For most compilers, the
U_ALIASING_BARRIER
macro is a no-op and ICU relies on the compiler not taking advantage ofchar16_t
being a distinct non-aliasing type of the other ICU supported UTF-16 character types.For Clang and gcc, ICU defines the macro as follows and invokes it immediately before using
reinterpret_cast
to convert between pointers tochar16_t
and other supported UTF-16 character types. The (volatile) inline assembly prevents the optimizer from reordering loads and stores across the inline assembly and the "memory" clobber informs the compiler that memory read before the inline assembly must be re-read, thus forming a read/write memory barrier. See the gcc documentation for more details.The introduction of
char8_t
as a non-aliasing type in C++20 creates a similar need for some form of an alias barrier that allows limited interchange between libraries that usechar8_t
for UTF-8 data internally and those that usechar
orunsigned char
for UTF-8 data internally. Though the same problem applies in principle forchar8_t
with respect tochar16_t
, in practice this is less of a concern becausechar
andunsigned char
are aliasing types.Converting a pointer to one type to a pointer to another unrelated type requires use of
reinterpret_cast
and that prevents performing such conversions in constant expressions and, likely, introduces UB. An alias barrier could potentially allow such conversions in constant expressions between types that meet certain compatibility requirements; for example, a common underlying type.Recent exploration of this area has uncovered some simple test cases that demonstrate that an alias barrier is needed in practice for some compilers. The following links contain code that does not perform as intended. In each case, three attempts are made to "fix" the example using various approaches. Of the approaches tried, ICU's volatile inline assembly trick is the only one that works in all cases. In each case, the intended behavior is that the program output "Hi" when run.
Though ICU's inline assembly trick does seem to work for all of these cases, it has the downsize of pessimizing optimizers more than is necessary or desired. A more targeted solution is therefore desired.
The text was updated successfully, but these errors were encountered: