n2653.html

<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

<head>

<title>char8_t: A type for UTF-8 characters and strings (Revision 1)</title>

<style type="text/css">
pre {
    display: inline;
}

table#header th,
table#header td
{
    text-align: left;
}
table#references th,
table#references td
{
    vertical-align: top;
}

ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }

blockquote
{
    color: #000000;
    background-color: #F1F1F1;
    border: 1px solid #D1D1D1;
    padding-left: 0.5em;
    padding-right: 0.5em;
}
blockquote.stdins
{
    text-decoration: underline;
    color: #000000;
    background-color: #C8FFC8;
    border: 1px solid #B3EBB3;
    padding: 0.5em;
}
blockquote.stddel
{
    text-decoration: line-through;
    color: #000000;
    background-color: #FFEBFF;
    border: 1px solid #ECD7EC;
    padding-left: 0.5empadding-right: 0.5em;
}

blockquote.quote
{
    margin-top: 0em;
    margin-left: 0em;
    border-style: solid;
    background-color: lemonchiffon;
    color: #000000;
    border: 1px solid black;
}

div.compare {
  padding-left: 40px;
  display: table; /* undo float:left effect */
}
div.compare_item {
  float: left;
  margin: 2px;
}

</style>

</head>


<body>

<table id="header">
  <tr>
    <th>Proposal for C2x</th>
  </tr>
  <tr>
    <th>WG14 N2653</th>
  </tr>
  <tr>
    <th/>
  </tr>
  <tr>
    <th>Title:</th>
    <td>char8_t: A type for UTF-8 characters and strings (Revision 1)</td>
  </tr>
  <tr>
    <th>Revises:</th>
    <td><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">N2231</a></td>
  </tr>
  <tr>
    <th>Author:</th>
    <td>Tom Honermann &lt;tom@honermann.net&gt;</td>
  </tr>
  <tr>
    <th>Date:</th>
    <td>2021-06-04</td>
  </tr>
  <tr>
    <th>Proposal category:</th>
    <td>New features, change to existing features</td>
  </tr>
  <tr>
    <th>Target audience:</th>
    <td>Developers working on combined C and C++ code bases</td>
  </tr>
</table>


<p>
<strong>Abstract:</strong> C++20, through the adoption of
<a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
   href="https://wg21.link/p0482r6">
WG21 P0482R6</a>
<sup><a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
        href="#ref_wg21_p0482r6">
[WG21 P0482R6]</a></sup>,
added a new <tt>char8_t</tt> fundamental type, changed the character
type of <tt>u8</tt> character and string literals from <tt>char</tt> to
<tt>char8_t</tt>, and added the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt>
functions for conversion between multibyte characters and UTF-8.
This paper proposes corresponding changes for C to add a <tt>char8_t</tt>
typedef name with type <tt>unsigned char</tt>, to change the array element
type of <tt>u8</tt> string literals from <tt>char</tt> to <tt>unsigned char</tt>
(<tt>u8</tt> character literals already have type <tt>unsigned char</tt>),
and to add the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> functions.
These changes are intended to maintain compatibility between C and C++ and
to improve portable support for UTF-8.
</p>


<ul>
  <li><a href="#changes_since_n2231">
      Changes since N2231</a></li>
  <li><a href="#introduction">
      Introduction</a></li>
  <li><a href="#motivation">
      Motivation</a></li>
  <li><a href="#design_options">
      Design Options</a></li>
    <ul>
      <li><a href="#do_char8_t_type">
          The <tt>char8_t</tt> type: typedef name vs a new integer type</li>
      <li><a href="#do_char8_t_underlying_type">
          The underlying type of <tt>char8_t</tt></li>
      <li><a href="#do_u8_string_lit_type">
          UTF-8 string literal type</li>
      <li><a href="#do_char_array_init">
          <tt>char</tt> array initialization by a UTF-8 string literal</li>
    </ul>
  </li>
  <li><a href="#proposal">
      Proposal</a></li>
  <li><a href="#backward_compat">
      Backward Compatibility</a>
    <ul>
      <li><a href="#bc_pointer_conversion">
          Pointer conversion from a UTF-8 string literal</li>
      <li><a href="#bc_string_lit_element_value">
          The value of a UTF-8 string literal element</li>
      <li><a href="#bc_type_inference">
          Type inference</li>
    </ul>
  </li>
  <li><a href="#implementation_exp">
      Implementation Experience</a></li>
  <li><a href="#wording">
      Formal Wording</a></li>
  <li><a href="#acknowledgements">
      Acknowledgements</a></li>
  <li><a href="#references">
      References</a></li>
</ul>


<h1 id="changes_since_n2231">Changes since <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">N2231</a></h1>

<ul>
  <li>Proposal changes:
    <ul>
      <li>Rebased the proposed wording on
          <a title="[WG14 N2596]: C2x Working Draft"
             href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf">
          WG14 N2596</a>
          <sup><a title="[WG14 N2596]: C2x Working Draft"
                  href="#ref_wg14_n2596">
          [WG14 N2596]</a></sup></li>
      <li>Updated wording to address <tt>u8</tt> character literals and removed
          references to
          <a title="[WG14 N2198]: Adding the u8 character prefix"
             href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
          WG14 N2198</a>
          since it has been incorporated in the working draft.</li>
      <li>Removed drafting notes regarding
          <a title="[WG14 DR 488]: c16rtomb() on wide characters encoded as multiple char16_t"
             href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488">
          WG14 DR 488</a>
          since its resolution has now been incorporated in the working draft.</li>
      <li>Removed the previously proposed change to disallow initialization of
          an array of type <tt>char</tt> or <tt>signed char</tt> by a UTF-8
          string literal.</li>
      <li>Removed the previously proposed <tt>__STDC_UTF_8__</tt> macro since UTF-8
          character and string literals and the <tt>char8_t</tt> type are intended
          only for use with UTF-8.</li>
    </ul>
  </li>
  <li>Other changes:
    <ul>
      <li>Rewrote the abstract to reflect that
          <a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
             href="https://wg21.link/p0482r6">
          WG21 P0482R6</a>
          <sup><a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
                  href="#ref_wg21_p0482r6">
          [WG21 P0482R6]</a></sup>
          was adopted for C++20.</li>
      <li>Rewrote the <a href="#motivation">Motivation</a> section.</li>
      <li>Added the <a href="#design_options">Design Options</a> section.</li>
      <li>Expanded the <a href="#backward_compat">Backward Compatibility</a> section.</li>
      <li>Updated the <a href="#implementation_exp">Implementation Experience</a>
          section with links to completed implementations in gcc and glibc.</li>
      <li>Removed use of highlight.js for code highlighting purposes.</li>
    </ul>
  </li>
</ul>


<h1 id="introduction">Introduction</h1>

<p>C11 introduced support for UTF-8, 16-bit, and 32-bit encoded string
literals.
New <tt>char16_t</tt> and <tt>char32_t</tt> typedef names were added to hold
values of code units for the 16-bit and 32-bit variants, but a new type
or typedef name was not added for the UTF-8 variant.
Instead, UTF-8 string literals were specified with the same type used for
ordinary string literals; array of <tt>char</tt>.
UTF-8 is the only character encoding mandated to be supported by the C
standard for which the standard does not provide a distinctly named code
unit type.
</p>

<p>Whether <tt>char</tt> is a signed or unsigned type is implementation
defined.
Implementations that use an 8-bit signed char are at a disadvantage with
respect to working with UTF-8 encoded text since the value range of their
implementation of char does not extend to the full range of UTF-8 code unit
values; programmers working with such implementations must inject casts
to unsigned char for portable code to correctly process lead and continuation
code unit values.
</p>

<p>The lack of a distinct type and the use of a code unit type with a range
that does not portably include the full unsigned range of UTF-8 code units
presents challenges for working with UTF-8 encoded text that are not present
when working with UTF-16 or UTF-32 encoded text.
Enclosed is a proposal for a new <tt>char8_t</tt> typedef and related language
and library enhancements intended to better facilitate portable handling of
UTF-8 encoded text and to enable working with all five of the standard
mandated character encodings in a consistent manner.
</p>


<h1 id="motivation">Motivation</h1>

<p>
As of February 2021,
<a title="Usage of UTF-8 for websites"
   href="https://w3techs.com/technologies/details/en-utf8/all/all">
UTF-8 is now used by more than 96% of all websites</a>
<sup><a title="Usage of UTF-8 for websites"
        href="#ref_w3techs">
[W3Techs]</a></sup>.
While UTF-8 now dominates websites, it has not attained similar adoption
rates in the execution environments of C and C++ programs.
Microsoft has introduced several ways in which a program can opt-in to use of
UTF-8 as the Active Code Page (ACP) starting with the April 2018 update of
Windows 10, but, by default, the ACP remains dependent on region settings.
Most POSIX systems, including Linux and macOS, use UTF-8 as the system
encoding by default, but continue to support changing the execution environment
encoding via locale related environment variables like <tt>LC_ALL</tt>.
Systems built on EBCDIC, like IBM's z/OS, continue to remain significant players
in the C and C++ ecosystems.
</p>

<p>
Programs that consume or produce UTF-8 text and text for which the encoding is
dependent on the execution environment must choose one of a few approaches to
manage text represented in these potentially distinct encodings:
<ul>
  <li>Use <tt>char</tt> for all text and meticulously track which encoding
      is to be used at all times.</li>
  <li>Use <tt>char</tt> for all text, but meticulously convert to or from UTF-8
      when interacting with the environment so that text is always represented
      as UTF-8 within a component.</li>
  <li>Use <tt>char</tt> when working with text for the execution environment,
      and a different type, generally <tt>unsigned char</tt>, for UTF-8
      encoded text.</li>
</ul>
</p>

<p>
The challenge with the first two approaches is ensuring that text is
appropriately tagged and converted as it flows through the program.
Since the same type, <tt>char</tt>, is used as the code unit type for all
text, the programmer is unable to rely on the type system to help identify
when text has not been appropriately converted.
</p>

<p>
The challenge with the third approach is the lack of a common type that
unambiguously denotes UTF-8 text across components.
Within a program, even if there is agreement on an alternate type to use,
UTF-8 string literals still have type array of <tt>char</tt>, not the
agreed upon type.
</p>

<p>
The adoption of a <tt>char8_t</tt> type via
<a title="[P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
   href="https://wg21.link/p0482r6">
P0482R6</a>
<sup><a title="[P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
        href="#ref_wg21_p0482r6">
[P0482R6]</a></sup>
for C++20 provided a common type tailored for use with UTF-8 text.
Adoption of a similar type for C would facilitate source code compatibility
between C and C++20, establish a standard common type for programmers that
prefer the third approach above, and provide consistent behavior across
implementations without the difficulties imposed by the implementation-defined
signedness of <tt>char</tt>.
</p>

<p>
Consider the following function that purports to check whether a pointer
points to UTF-8 text that begins with a UTF-8 leading byte.
UTF-8 leading bytes have values in the range 192 (0xC0) to 255 (0xFF) though
not all values in that range may appear in valid UTF-8 encoded text.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char *s) {
  return *s &gt;= 0xC0;
}
</pre></blockquote>
For implementations that define <tt>char</tt> as either an unsigned type or
with a size greater than 8 bits, this function will correctly classify its
inputs (assuming no invalid values).
However, for implementations that define <tt>char</tt> as a signed 8-bit type
with a two's complement representation and a range of -128 (-0x80) to
127 (0x7F), the values of UTF-8 leading bytes become negative values with
the result that this function always returns false.
For the function to behave consistently across implementations, it must be
modified to ensure the comparison is performed with an unsigned type.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char *s) {
  return (unsigned char)*s &gt;= 0xC0;
}
</pre></blockquote>
The introduction of a <tt>char8_t</tt> type that behaves as an unsigned type
would allow the function to be simply implemented as follows such that it
behaves the same for all C and C++20 implementations.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char8_t *s) {
  return *s &gt;= 0xC0;
}
</pre></blockquote>
</p>

<p>
Functions like the <tt>starts_with_utf8_leading_byte()</tt> example above
are not frequently written and the problem exhibited can be easily
discovered and corrected during testing.
However, more insidious problems may be encountered in other cases, such as
with the <tt>&lt;ctype.h&gt;</tt> character classification functions.
Consider the following code that naively attempts to convert its input to
uppercase using <tt>toupper()</tt>.
<blockquote><pre>
void convert_to_uppercase(char *p) {
  for (; *p; ++p) {
    *p = toupper(*p);
  }
}
</pre></blockquote>
</p>

<p>
When called with a UTF-8 encoded string that contains non-ASCII characters,
this function encounters undefined behavior for implementations with an 8-bit
signed <tt>char</tt> type; even when the current locale is UTF-8-based.
The problem is that lead and continuation UTF-8 code unit values are negative
for such implementations and may result in a sign extended negative value
(that does not match <tt>EOF</tt>) being passed to <tt>toupper()</tt>.
The result is undefined behavior according to
C17 7.4, "Character handling &lt;ctype.h&gt;", paragraph 1:
<div style="margin-left: 1em;">
<blockquote class="quote">
The header <tt>&lt;ctype.h&gt;</tt> declares several functions useful for
classifying and mapping characters.<sup>202)</sup>
In all cases the argument is an <tt>int</tt>, the value of which shall be
representable as an <tt>unsigned char</tt> or shall equal the value of the
macro <tt>EOF</tt>.
If the argument has any other value, the behavior is undefined.
</blockquote>
</div>
For this code to portably work as intended, the argument to <tt>toupper()</tt>
must be cast to <tt>unsigned char</tt>.
Alternatively, changing the type of the <tt>convert_to_uppercase()</tt>
parameter to the proposed <tt>char8_t</tt> type would portably correct the
code while also signifying that the intended input is UTF-8.
</p>


<h1 id="design_options">Design Options</h1>


<h2 id="do_char8_t_type">The <tt>char8_t</tt> type: typedef name vs a new integer type</h2>

<p>
When the <tt>char16_t</tt> and <tt>char32_t</tt> types were introduced in C11
and C++11, a choice was faced whether to introduce them as typedef names of
existing types or as new integer types.
The WG14 and WG21 committees chose different directions; WG14 opted for
typedef names for C and WG21 opted for new integer types for C++.
This choice was consistent with prior choices regarding the <tt>wchar_t</tt>
type.
The same choice applies for the introduction of a <tt>char8_t</tt> type.
</p>

<p>
The <tt>char16_t</tt> and <tt>char32_t</tt> types were added to C++11 by the
adoption of
<a title="[WG21 N2249]: New Character Types in C++"
   href="https://wg21.link/n2249">
WG21 N2249</a>
<sup><a title="[WG21 N2249]: New Character Types in C++"
        href="#ref_wg21_n2249">
[WG21 N2249]</a></sup>.
The motivation for new integer types stated in that proposal includes the
ability to support function overloading and template specialization; abilities
that would not be possible, at least not reliably and portably, if the new
types were simply typedef names of existing types.
At the time these types were adopted, C did not yet have support for generic
programming; the <tt>_Generic</tt> generic selection expression had not yet
been adopted.
Thus, there was little to no motivation for WG14 to impose the additional effort
required to support new integer types on implementors.
</p>

<p>
WG14 now has several proposals to improve support for generic programming in C:
<ul>
  <li><a title="WG14 N2734: Improve type generic programming"
         href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2734.pdf">
      WG14 N2734: Improve type generic programming</a>
      <sup><a title="[WG14 N2734]: Improve type generic programming"
              href="#ref_wg14_n2734">
      [WG14 N2734]</a></sup></li>
  <li><a title="WG14 N2724: Not-So-Magic - typeof(...) in C | r3"
         href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm">
      WG14 N2724: Not-So-Magic - typeof(...) in C | r3</a>
      <sup><a title="[WG14 N2724]: Not-So-Magic - typeof(...) in C | r3"
              href="#ref_wg14_n2724">
      [WG14 N2724]</a></sup></li>
  <li><a title="WG14 N2738: Type-generic lambdas"
         href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2738.pdf">
      WG14 N2738: Type-generic lambdas</a>
      <sup><a title="[WG14 N2738]: Type-generic lambdas"
              href="#ref_wg14_n2738">
      [WG14 N2738]</a></sup></li>
  <li><a title="WG14 N2735: Type inference for variable definitions and function returns"
         href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2735.pdf">
      WG14 N2735: Type inference for variable definitions and function returns</a>
      <sup><a title="[WG14 N2735]: Type inference for variable definitions and function returns"
              href="#ref_wg14_n2735">
      [WG14 N2735]</a></sup></li>
</ul>
Desire for generic programming improvements may translate to additional
motivation for distinct integer types for character data.
The following example illustrates a potential use case that would be enabled
by distinct types.
<blockquote><pre>
void send_narrow(const char*);
void send_wide(const wchar_t*);
void send_utf8(const char8_t*);
void send_utf16(const char16_t*);
void send_utf32(const char32_t*);
#define send(X)                            \
        _Generic((X),                      \
                 char*:     send_narrow,   \
                 wchar_t*:  send_wide,     \
                 char8_t*:  send_utf8,     \
                 char16_t*: send_utf16,    \
                 char32_t*: send_utf32)(X)
void f() {
  send(L"text");   /* Would be ok with distinct types; calls send_wide(). */
  send(u8"text");  /* Would be ok with distinct types; calls send_utf8(). */
}
</pre></blockquote>
</p>

<p>
Clang supports an
<a title="Clang 11 documentation, Attributes in Clang"
   href="https://releases.llvm.org/11.0.0/tools/clang/docs/AttributeReference.html#overloadable">
extension that enables overloading in C</a>
<sup><a title="Clang 11 documentation, Attributes in Clang"
        href="#ref_clang_overloadable">
[Clang overloadable]</a></sup>.
If adopted by WG14, the code above could be more simply written as:
<blockquote><pre>
void __attribute__((overloadable)) send(const char*);
void __attribute__((overloadable)) send(const wchar_t*);
void __attribute__((overloadable)) send(const char8_t*);
void __attribute__((overloadable)) send(const char16_t*);
void __attribute__((overloadable)) send(const char32_t*);
void f() {
  send(L"text");   /* Would be ok with distinct types; calls send(const wchar_t*). */
  send(u8"text");  /* Would be ok with distinct types; calls send(const char8_t*). */
}
</pre></blockquote>
</p>

<p>
Additional motivation for distinct integer types is the ability to specify
them as non-aliasing types.
A non-aliasing type is one for which objects of the type may only be accessed
using a limited set of types; compatible types and specially designated types
like <tt>char</tt> and <tt>unsigned char</tt>.
Compilers may use type based alias analysis (TBAA) to generate more efficient
code for non-aliasing types.
Aliasing violations result in undefined behavior.
</p>

<p>
The following example code would be well-formed in C regardless of whether
<tt>char8_t</tt> is specified as a new integer type or as a typedef name of
an existing character type.
If <tt>char8_t</tt> is specified as a typedef name of an existing character
type, then the example also works as expected because it does not violate
aliasing rules.
However, if <tt>char8_t</tt> is specified as a new integer type, then the
example would exhibit undefined behavior because an object of type
<tt>char</tt> is accessed using the <tt>char8_t</tt> type (assuming no new
special provisions added to C17 6.5, Expressions, paragraph 7).
Thus, there is a trade-off between code efficiency and safety inherent in
how <tt>char8_t</tt> is defined.
<blockquote><pre>
void do_utf8_things(const char8_t *s) {
  *s;
}
void f() {
  const char *presumably_utf8_text = "text";
  do_utf8_things(presumably_utf8_text);
}
</pre></blockquote>
</p>

<p>
Since <tt>char8_t</tt> is a distinct type in C++ and the C++ type system
prohibits implicit access to objects with an incompatible type without use of
a cast, the above example is ill-formed in C++20.
However, the code may be rendered well-formed in C++20 by the addition of a
cast, but will then result in undefined behavior when executed.
<blockquote><pre>
void do_utf8_things(const char8_t *s) {
  *s;
}
void f() {
  const char *presumably_utf8_text = "text";
  do_utf8_things((const char8_t*)presumably_utf8_text);
}
</pre></blockquote>
Such a cast might be added by a C programmer in order to silence warnings
regarding a change of signedness that might be produced when the
<tt>const char*</tt> argument to <tt>do_utf8_things()</tt> is converted to
<tt>const char8_t*</tt>; assuming <tt>char8_t</tt> is a typedef name of a
differently signed character type (otherwise, if <tt>char8_t</tt> were a
distinct type, the code would exhibit undefined behavior whether or not
the cast was present).
In that case, the unfortunate result is that the code is well-formed for
both C and C++, but exhibits undefined behavior only when compiled for C++.
</p>

<p>
This aliasing asymmetry between C and C++ is not a new concern; it already
exists for the <tt>wchar_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt>
types.
For example, <tt>char16_t</tt> and <tt>uint_least16_t</tt> are distinct
integer types in C++ (and do not alias), but are the same type in C.
Whether these aliasing issues are more significant for <tt>char8_t</tt> as
opposed to the other character types is a subjective concern.
</p>

<p>
Introduction of a new <tt>char8_t</tt> integer type without a corresponding
change to make <tt>wchar_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt>
distinct integer types would be inconsistent and surprising.
While the author sees potential use for distinct types as shown above, such
a change of direction should be pursued via a separate proposal.
Should WG14 indicate support for such direction when reviewing this proposal,
the author will submit a separate proposal.
In the meantime, this proposal advocates for only a new <tt>char8_t</tt>
typedef name in order to maintain consistency with the existing character
types.
</p>

<p>
<b>Proposed: a new <tt>char8_t</tt> typedef name defined in the <tt>uchar.h</tt>
header.</b>
</p>


<h2 id="do_char8_t_underlying_type">The underlying type of <tt>char8_t</tt></h2>

<p>
UTF-8 code unit values range from 0x00 to 0xF5 (the values 0xC0, 0xC1, and 0xF5
through 0xFF do not occur in well-formed UTF-8 code unit sequences) and
therefore require at least an 8-bit type for storage.
</p>

<p>
The existing <tt>char16_t</tt> and <tt>char32_t</tt> typedef names are defined
as having the same type as <tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>
respectively.
This suggests that the underlying type of <tt>char8_t</tt> should be the same
type as <tt>uint_least8_t</tt>.
However, the latitude provided for the <tt>uint_least8_t</tt> typedef name to
be defined with a type other than <tt>unsigned char</tt> provides no benefit
for the proposed <tt>char8_t</tt> type; <tt>unsigned char</tt> is already
defined to be unsigned with a size and alignment of 1 byte.
Since bytes are constrained to be at least 8-bits and no smaller types are
possible, additional leniency would only serve to limit portability.
</p>

<p>
The type of character constants with a <tt>u8</tt> <i>encoding-prefix</i> is
already <tt>unsigned char</tt>.
The underlying type for <tt>char8_t</tt> in C++20 is also
<tt>unsigned char</tt>.
For consistency with <tt>u8</tt> character constants and the C++20
<tt>char8_t</tt> type, this proposal defines the underlying type of the
proposed <tt>char8_t</tt> type to be <tt>unsigned char</tt>.
</p>

<p>
<b>Proposed: The underlying type of <tt>char8_t</tt> is
<tt>unsigned char</tt>.</b>
</p>


<h2 id="do_u8_string_lit_type">UTF-8 string literal type</h2>

<p>
In C17, a UTF-8 string literal has type array of <tt>char</tt>.
Since the size and signedness of <tt>char</tt> are implementation-defined,
portable code requires casts to an unsigned type when reading UTF-8
code unit values stored in objects of type <tt>char</tt>.
This is required because common implementations implement <tt>char</tt>
as a signed 8-bit type for which integer promotion rules produce a
negative value for leading and trailing code unit values (which all have
values above 0x7F).
While it is uncommon for code to directly access the elements of a string
literal, such accesses may occur when macros are involved.
</p>

<p>
In the working draft, UTF-8 character constants have a type of
<tt>unsigned char</tt>.
That results in a surprising inconsistency with UTF-8 string literals.
<blockquote><pre>
#define M(X) ((X) &gt;= 0x80)

void f() {
  M(u8"\U00E9"[0]); /* True for some implementations, false for others.
                       U+00E9 is encoded as 0xC3 0xA9 in UTF-8.
                       0xC3 will promote to a negative integer value for
                       implementations with a signed 8-bit char type. */
  M(u8'\xC3');      /* True for all implementations. */
}
</pre></blockquote>
Changing the type of a UTF-8 string literal to an array of type <tt>char8_t</tt>
would avoid this inconsistency such that both expressions above would result in
a true value for all implementations.
</p>

<p>
For consistency with <tt>u8</tt> character constants and the type of C++20
UTF-8 string literals, this proposal changes the type of a UTF-8 string literal
from array of <tt>char</tt> to array of <tt>char8_t</tt>.
The <a href="#backward_compat">Backward Compatibility</a> section discusses the
impact of this change.
</p>

<p>
<b>Proposed: The type of UTF-8 string literals is changed from array of
<tt>char</tt> to array of <tt>char8_t</tt>.</b>
</p>


<h2 id="do_char_array_init"><tt>char</tt> array initialization by a UTF-8 string literal</h2>

<p>
In C17, arrays of type <tt>char</tt>, <tt>signed char</tt>, and
<tt>unsigned char</tt> may be initialized by a UTF-8 string
literal.  These were all made ill-formed in C++20 where only
arrays of <tt>char8_t</tt> may be initialized by a UTF-8 string
literal.
<blockquote><pre>
const          char cu8[]  = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */
const signed   char scu8[] = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */
const unsigned char ucu8[] = u8"text";  /* Ok in C17 and C++17, ill-formed in C++20. */
</pre></blockquote>
</p>

<p>
For other character types, whether an array of the character type can be
initialized by a string literal with a mismatched encoding prefix depends
on the implementation.
C17 6.7.9, "Initialization", paragraph 15 states:
<div style="margin-left: 1em;">
<blockquote class="quote">
An array with element type compatible with a qualified or unqualified version
of <tt>wchar_t</tt>, <tt>char16_t</tt>, or <tt>char32_t</tt> may be initialized
by a wide string literal with the corresponding encoding prefix (<tt>L</tt>,
<tt>u</tt>, or <tt>U</tt>, respectively), optionally enclosed in braces.
Successive wide characters of the wide string literal (including the
terminating null wide character if there is room or if the array is of unknown
size) initialize the elements of the array.
</blockquote>
</div>
C++ does not allow initialization of mismatched encoding prefixes.
<blockquote><pre>
const wchar_t  wc16[] = u"text";  /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const wchar_t  wc32[] = U"text";  /* Ok in C17 if wchar_t and char32_t are compatible types, ill-formed in C++20. */
const char16_t c16w[] = L"text";  /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const char32_t c32w[] = L"text";  /* Ok in C17 if char32_t and wchar_t are compatible types, ill-formed in C++20. */
</pre></blockquote>
</p>

<p>
Prohibiting initialization of arrays of type <tt>char</tt> and
<tt>signed char</tt> by UTF-8 string literals would improve consistency with
C++20.
However, the existing inconsistencies are fully explainable as a consequence of
the choice to use existing integer types for wide character types in C vs the
choice to introduce new integer types in C++.
If WG14 were to decide to switch to use of distinct integer types for wide
character types (and <tt>char8_t</tt>) in the future, then it would make sense
to align initialization allowances with C++.
Until then, this proposal preserves the existing ability to initialize an
array of plain <tt>char</tt> or an array of <tt>signed char</tt>
with a UTF-8 string literal.
</p>

<p>
<b>Proposed: initialization of an array of type <tt>char</tt> or an array of
type <tt>signed char</tt> by a UTF-8 string literal remains well-formed.</b>
</p>


<h1 id="proposal">Proposal</h1>

<p>The proposed changes include:

<ul>
  <li>A new <tt>char8_t</tt> typedef name with type <tt>unsigned char</tt>
      defined in the <tt>&lt;uchar.h&gt;</tt> header.</li>
  <li>The type of UTF-8 string literals is changed from array of
      <tt>char</tt> to array of <tt>char8_t</tt>.</li>
  <li>The type of UTF-8 character literals is changed from
      <tt>unsigned char</tt> to <tt>char8_t</tt>.<br/>
      (Since UTF-8 character literals already have type <tt>unsigned char</tt>,
      this is not a semantic change).</li>
  <li>Initialization of an array of type <tt>char</tt> or type
      <tt>signed char</tt> by a UTF-8 string literal remains well-formed.</li>
  <li>New <tt>mbrtoc8()</tt> and <tt>c8rtomb()</tt> functions declared in
      <tt>&lt;uchar.h&gt;</tt> enable conversions between multibyte characters
      and UTF-8.</li>
  <li>A new <tt>ATOMIC_CHAR8_T_LOCK_FREE</tt> macro.</li>
  <li>A new <tt>atomic_char8_t</tt> typedef name.</li>
</ul>
</p>


<h1 id="backward_compat">Backward Compatibility</h1>

<p>The proposed change to the type of UTF-8 string literals impacts backward
compatibility as described in the following sections.
Implementors are encouraged to offer options to disable <tt>char8_t</tt>
support when necessary to preserve compatibility with C17.
</p>


<h2 id="bc_pointer_conversion">Pointer conversion from a UTF-8 string literal</h2>

<p>
Initialization or assignment of <tt>char</tt> pointers (including
parameters) from UTF-8 string literals remains well-formed under this
proposal.
However, some implementations may produce warnings about differences in
signedness depending on whether <tt>char</tt> is a signed or unsigned type.
</p>

<p>
For example:
<blockquote><pre>
const char *p = u8"text"; // Well-formed in C17 and with this proposal, but
                          // implementations may now warn about different
                          // signedness for the pointer target type.
</pre></blockquote>
</p>


<h2 id="bc_string_lit_element_value">The value of a UTF-8 string literal element</h2>

<p>
Code that directly accesses the code unit values of UTF-8 string literals
without an intervening cast to an unsigned type may observe different values
under this proposal.
This will occur for implementations with a signed 8-bit <tt>char</tt> type
when accessing a leading or trailing UTF-8 code unit (such code units have a
value in the range <tt>0x80</tt> through <tt>0xFF</tt>).
</p>

<p>
For example:
<blockquote><pre>
if (u8"\u00E9"[0] &lt; 0) {} // Well-formed with implementation-defined behavior
                          // in C17.  Well-formed with portable behavior with
                          // this proposal (the conditional is always false).
</pre></blockquote>
</p>

<p>
The author is unaware of use cases that involve directly probing the values
of UTF-8 string literal elements, but such accesses may occur as a result of
macro processing.
Code intended to be portable will already contain an appropriate cast to an
unsigned type and will therefore be unaffected by this proposal.
Non-portable code that relies on leading and trailing UTF-8 code unit values
having a negative value will require modification.
</p>


<h2 id="bc_type_inference">Type inference</h2>

<p>
Code that makes use of <tt>_Generic</tt> expressions, type inference extensions
such as gcc's <tt>__typeof__</tt> type specifier, or Clang's extension for
overloading in C may become ill-formed or behave differently with this proposal.
</p>

<p>
In the following example, <tt>serialize</tt> is a type-generic macro that, based
on the type of its argument, dispatches to either <tt>serialize_text()</tt>,
<tt>serialize_wide_text()</tt>, <tt>serialize_int()</tt>. or
<tt>serialize_double()</tt>.
With this proposal, there is no longer a type match, so the code becomes
ill-formed.
This code can be corrected on the caller side by adding a cast to <tt>char*</tt>
or on the callee side by adding a type match for <tt>unsigned char*</tt>.
The latter approach has the benefit of allowing <tt>serialize</tt> to dispatch
to a <tt>serialize_u8text()</tt> function that specifically handles UTF-8
encoded text.
<blockquote><pre>
void serialize_text(const char*);
void serialize_wide_text(const wchar_t*);
void serialize_int(int);
void serialize_double(double);
#define serialize(X) _Generic((X),                           \
                              char*:    serialize_text,      \
                              wchar_t*: serialize_wide_text, \
                              int:      serialize_int,       \
                              double:   serialize_double)(X)
void f() {
  serialize(u8"text"); // Well-formed in C17, ill-formed with this proposal.
}
</pre></blockquote>
</p>

<p>
The following example reimplements the serialization example, using Clang's
extension for overloading in C.
In this case, the change of type for the UTF-8 string literal results in
ambiguous  overload resolution.
Here again, the code can be corrected on the caller side by adding a cast, or
can be corrected on the callee side by adding an overload for
<tt>const unsigned char*</tt>.
Again, the latter has the benefit of enabling UTF-8 encoded text to be handled
differently than text matching the execution character set.
<blockquote><pre>
void serialize(const char*)    __attribute__((overloadable));
void serialize(const wchar_t*) __attribute__((overloadable));
void serialize(int)            __attribute__((overloadable));
void serialize(double)         __attribute__((overloadable));
void f() {
  serialize(u8"text"); // Well-formed in C17 with Clang's overloading extension.
                       // Ill-formed with this proposal.
}
</pre></blockquote>
</p>


<h1 id="implementation_exp">Implementation Experience</h1>

<p>The proposed changes have been implemented in forks of gcc and glibc and
are available in the <tt>char8_t-for-c</tt> and <tt>char8_t</tt> branches
respectively of the following repositories:
<ul>
  <li>gcc: <a href="https://github.com/tahonermann/gcc/tree/char8_t-for-c">
      https://github.com/tahonermann/gcc/tree/char8_t-for-c</a></li>
  <li>glibc: <a href="https://github.com/tahonermann/glibc/tree/char8_t">
      https://github.com/tahonermann/glibc/tree/char8_t</a></li>
</ul>
</p>

<p>
The changes to glibc provide declarations for the <tt>char8_t</tt> typedef
name and the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> functions.
When compiling for C, these declarations are only present when the
<tt>_CHAR8_T_SOURCE</tt> feature test macro is defined.
</p>

<p>
The changes to gcc provide the <tt>atomic_char8_t</tt> typedef name, the
<tt>ATOMIC_CHAR8_T_LOCK_FREE</tt> macro, and the change of type for UTF-8
literals from array of <tt>char</tt> to array of <tt>unsigned char</tt>.
The existing <tt>-fchar8_t</tt> and <tt>-fno-char8_t</tt> compiler options
are extended to C code to allow opting-in or opting-out of these changes.
When <tt>-fchar8_t</tt> is enabled, the <tt>_CHAR8_T_SOURCE</tt> macro is
defined to inform the C library that the <tt>char8_t</tt> typedef name and
the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> declarations should be provided
by the <tt>uchar.h</tt> header.
</p>


<h1 id="wording">Formal Wording</h1>

<input type="checkbox" id="hidedel">Hide deleted text</input>

<p>These changes are relative to
<a title="[WG14 N2596]: C2x Working Draft"
   href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf">
WG14 N2596</a>
<sup><a title="[WG14 N2596]: C2x Working Draft"
        href="#ref_wg14_n2596">
[WG14 N2596]</a></sup>
</p>

<p>Change in 6.4.4 (Character constants) paragraph 9:
<blockquote>
The value of an octal or hexadecimal escape sequence shall be in the range of
representable values for the corresponding type:
<div style="margin-left: 1em;">
<table style="border-collapse: collapse;">
  <tr style="border-bottom: 1px solid black;">
    <td style="border-right: 1px solid black;">Prefix</td>
    <td>Corresponding type</td>
  </tr>
  <tr>
    <td style="border-right: 1px solid black;">none</td>
    <td><tt>unsigned char</tt></td>
  </tr>
  <tr>
    <td style="border-right: 1px solid black;"><tt>u8</tt></td>
    <td><tt><del>unsigned char</del><ins>char8_t</ins></tt></td>
  </tr>
  <tr>
    <td style="border-right: 1px solid black;"><tt>L</tt></td>
    <td>the unsigned type corresponding to <tt>wchar_t</tt></td>
  </tr>
  <tr>
    <td style="border-right: 1px solid black;"><tt>u</tt></td>
    <td><tt>char16_t</tt></td>
  </tr>
  <tr>
    <td style="border-right: 1px solid black;"><tt>U</tt></td>
    <td><tt>char32_t</tt></td>
  </tr>
</table>
</div>
</blockquote>
</p>

<p>Change in 6.4.4 (Character constants) paragraph 12:
<blockquote>
A UTF-8 character constant has type
<tt><del>unsigned char</del><ins>char8_t</ins></tt>.
The value of a UTF-8 character constant is equal to its ISO/IEC 10646 code
point value, provided that the code point value can be encoded as a
single UTF-8 code unit.
</blockquote>
</p>

<p>Change in 6.4.5 (String Literals) paragraph 6:
<blockquote>
[&hellip;] For UTF-8 string literals, the array elements have type
<del><tt>char</tt></del><ins><tt>char8_t</tt></ins>, and are initialized with
the characters of the multibyte character sequence, as encoded in UTF-8.
[&hellip;]
</blockquote>
</p>

<p>Change in 7.17.1 (Introduction) paragraph 3:
<blockquote>
The macros defined are the <em>atomic lock-free macros</em>
<blockquote>
ATOMIC_BOOL_LOCK_FREE<br/>
ATOMIC_CHAR_LOCK_FREE<br/>
<ins>ATOMIC_CHAR8_T_LOCK_FREE</ins><br/>
ATOMIC_CHAR16_T_LOCK_FREE<br/>
ATOMIC_CHAR32_T_LOCK_FREE<br/>
ATOMIC_WCHAR_T_LOCK_FREE<br/>
ATOMIC_SHORT_LOCK_FREE<br/>
ATOMIC_INT_LOCK_FREE<br/>
ATOMIC_LONG_LOCK_FREE<br/>
ATOMIC_LLONG_LOCK_FREE<br/>
ATOMIC_POINTER_LOCK_FREE<br/>
</blockquote>
[&hellip;]<br/>
</blockquote>
</p>

<p>Change in 7.17.6 (Atomic integer types) paragraph 1:
<blockquote>
For each line in the following
table,<sup><em>[Footnote: See "future library directions" (7.31.10).]</em></sup>
the atomic type name is declared as a type that has the same representation
and alignment requirements as the corresponding direct
type.<sup><em>[Footnote: The same representation and alignment requirements are
meant to imply interchangeability as arguments to functions, return values
from functions, and members of unions.]</em></sup>
<div style="margin-left: 1em;">
<table style="border-collapse: collapse;">
  <tr style="border-bottom: 1px solid black;">
    <td>Atomic type name</td>
    <td>Direct type</td>
  </tr>
  <tr>
    <td>[&hellip;]</td>
    <td>[&hellip;]</td>
  </tr>
  <tr>
    <td><tt>atomic_ullong</tt></td>
    <td><tt>_Atomic unsigned long long</tt></td>
  </tr>
  <tr>
    <td><tt><ins>atomic_char8_t</ins></tt></td>
    <td><tt><ins>_Atomic char8_t</ins></tt></td>
  </tr>
  <tr>
    <td><tt>atomic_char16_t</tt></td>
    <td><tt>_Atomic char16_t</tt></td>
  </tr>
  <tr>
    <td><tt>atomic_char32_t</tt></td>
    <td><tt>_Atomic char32_t</tt></td>
  </tr>
  <tr>
    <td><tt>atomic_wchar_t</tt></td>
    <td><tt>_Atomic wchar_t</tt></td>
  </tr>
  <tr>
    <td>[&hellip;]</td>
    <td>[&hellip;]</td>
  </tr>
</table>
</div>
</blockquote>
</p>

<p>Change in 7.28 (Unicode utilities &lt;uchar.h&gt;) paragraph 2:
<blockquote>
The types declared are <tt>mbstate_t</tt> (described in 7.29.1) and
<tt>size_t</tt> (described in 7.19);
<ins>
<blockquote>
<tt>char8_t</tt>
</blockquote>
which is an unsigned integer type used for UTF-8 characters and is the
same type as <tt>unsigned char</tt>; and</ins>
</ins>
<blockquote>
<tt>char16_t</tt>
</blockquote>
which is an unsigned integer type used for 16-bit characters and is the same
type as <tt>uint_least16_t</tt> (described in 7.20.1.12); and</ins>
<blockquote>
<tt>char32_t</tt>
</blockquote>
which is an unsigned integer type used for 32-bit characters and is the same
type as <tt>uint_least32_t</tt> (described in 7.20.1.12).</ins>
</blockquote>
</p>

<p>Insert a new subclause before 7.28.1.1 (The mbrtoc16 function):
<blockquote class="stdins">
7.28.1.1  <strong>The mbrtoc8 function</strong>
</blockquote>
</p>

<p>Add a new paragraph 1:
<blockquote class="stdins">
<strong>Synopsis</strong><br/>
<blockquote>
<div style="margin-left: 1em;">
<tt>#include</tt> &lt;uchar.h&gt;<br/>
<tt>size_t</tt> mbrtoc8(<tt>char8_t</tt> * <tt>restrict</tt> pc8,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>

<p>Add a new paragraph 2:
<blockquote class="stdins">
<strong>Description</strong><br/>
If <tt>s</tt> is a null pointer, the <tt>mbrtoc8</tt> function is equivalent
to the call:
<blockquote>
<div style="margin-left: 2em;">
mbrtoc8(NULL, "", 1, ps)
</div>
</blockquote>
In this case, the values of the parameters <tt>pc8</tt> and <tt>n</tt> are
ignored.
</blockquote>
</p>

<p>Add a new paragraph 3:
<blockquote class="stdins">
If <tt>s</tt> is not a null pointer, the <tt>mbrtoc8</tt> function inspects at
most <tt>n</tt> bytes beginning with the byte pointed to by <tt>s</tt> to
determine the number of bytes needed to complete the next multibyte character
(including any shift sequences). If the function determines that the next
multibyte character is complete and valid, it determines the values of the
corresponding characters and then, if <tt>pc8</tt> is not a null pointer,
stores the value of the first (or only) such character in the object pointed to
by <tt>pc8</tt>. Subsequent calls will store successive characters without
consuming any additional input until all the characters have been stored. If
the corresponding character is the null character, the resulting
state described is the initial conversion state.
</blockquote>
</p>

<p>Add a new paragraph 4:
<blockquote class="stdins">
<strong>Returns</strong><br/>
The <tt>mbrtoc8</tt> function returns the first of the following that applies
(given the current conversion state):
<table>
  <tr>
    <td>0</td>
    <td>if the next <tt>n</tt> or fewer bytes complete the multibyte character
        that corresponds to the null character (which is the value stored).
    </td>
  </tr>
  <tr>
    <td><em>between 1 and <tt>n</tt> inclusive</em></td>
    <td>if the next <tt>n</tt> or fewer bytes complete a valid multibyte
        character (which is the value stored); the value returned is the number
        of bytes that complete the multibyte character.
    </td>
  </tr>
  <tr>
    <td><tt>(size_t)</tt> (-3)</td>
    <td>if the next character resulting from a previous call has been stored
        (no bytes from the input have been consumed by this call).
    </td>
  </tr>
  <tr>
    <td><tt>(size_t)</tt> (-2)</td>
    <td>if the next <tt>n</tt> bytes contribute to an incomplete (but
        potentially valid) multibyte character, and all <tt>n</tt> bytes have
        been processed (no value is
        stored).<sup><em>[Footnote: When <tt>n</tt> has at least the value of
        the <tt>MB_CUR_MAX</tt> macro, this case can only occur if <tt>s</tt>
        points at a sequence of redundant shift sequences (for implementations
        with state-dependent encodings).]</em></sup>
    </td>
  </tr>
  <tr>
    <td><tt>(size_t)</tt> (-1)</td>
    <td>if an encoding error occurs, in which case the next <tt>n</tt> or
        fewer bytes do not contribute to a complete and valid multibyte
        character (no value is stored); the value of the macro <tt>EILSEQ</tt>
        is stored in <tt>errno</tt>, and the conversion state is unspecified.
    </td>
  </tr>
</table>
</blockquote>
</p>

<p>Insert another new subclause before 7.28.1.1 (The mbrtoc16 function):
<blockquote class="stdins">
7.28.1.2  <strong>The c8rtomb function</strong>
</blockquote>
</p>

<p>Add a new paragraph 1:
<blockquote class="stdins">
<strong>Synopsis</strong><br/>
<blockquote>
<div style="margin-left: 1em;">
<tt>#include</tt> &lt;uchar.h&gt;<br/>
<tt>size_t</tt> c8rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char8_t</tt> c8,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>

<p>Add a new paragraph 2:
<blockquote class="stdins">
<strong>Description</strong><br/>
If <tt>s</tt> is a null pointer, the c8rtomb function is equivalent to the call
<blockquote>
<div style="margin-left: 2em;">
c8rtomb(buf, u8'\0', ps)
</div>
</blockquote>
where <tt>buf</tt> is an internal buffer.
</blockquote>
</p>

<p>Add a new paragraph 3:
<blockquote class="stdins">
If <tt>s</tt> is not a null pointer, the <tt>c8rtomb</tt> function determines
the number of bytes needed to represent the multibyte character that corresponds
to the character given or completed by <tt>c8</tt> (including any shift
sequences), and stores the multibyte character representation in the array whose
first element is pointed to by <tt>s</tt>, or stores nothing if <tt>c8</tt> does
not represent a complete character.  At most <tt>MB_CUR_MAX</tt> bytes are
stored. If <tt>c8</tt> is a null character, a null byte is stored, preceded
by any shift sequence needed to restore the initial shift state; the resulting
state described is the initial conversion state.
</blockquote>
</p>

<p>Add a new paragraph 4:
<blockquote class="stdins">
<strong>Returns</strong><br/>
The <tt>c8rtomb</tt> function returns the number of bytes stored in the array
object (including any shift sequences). When <tt>c8</tt> is not a valid
character, an encoding error occurs: the function stores the value of the macro
<tt>EILSEQ</tt> in <tt>errno</tt> and returns <tt>(size_t)</tt> (-1); the
conversion state is unspecified.
</blockquote>
</p>

<p>Change in B.16 (Atomics &lt;stdatomic.h&gt;)
<blockquote>
[&hellip;]<br/>
<tt>ATOMIC_CHAR_LOCK_FREE</tt><br/>
<ins><tt>ATOMIC_CHAR8_T_LOCK_FREE</tt></ins><br/>
<tt>ATOMIC_CHAR16_T_LOCK_FREE</tt><br/>
<tt>ATOMIC_CHAR32_T_LOCK_FREE</tt><br/>
<tt>ATOMIC_WCHAR_T_LOCK_FREE</tt><br/>
[&hellip;]<br/>
<tt>atomic_ullong</tt><br/>
<ins><tt>atomic_char8_t</tt></ins><br/>
<tt>atomic_char16_t</tt><br/>
<tt>atomic_char32_t</tt><br/>
<tt>atomic_wchar_t</tt><br/>
[&hellip;]<br/>
</blockquote>
</p>

<p>Change in B.27 (Unicode utilities &lt;uchar.h&gt;)
<blockquote>
<table>
  <tr>
    <td><tt>mbstate_t</tt></td>
    <td><tt>size_t</tt></td>
    <td><tt><ins>char8_t</ins></tt></td>
    <td><tt>char16_t</tt></td>
    <td><tt>char32_t</tt></td>
  </tr>
</table>
<blockquote>
<div style="margin-left: 1em;">
<ins>
<tt>size_t</tt> mbrtoc8(<tt>char8_t</tt> * <tt>restrict</tt> pc8,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c8rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char8_t</tt> c8,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
</ins>
<tt>size_t</tt> mbrtoc16(<tt>char16_t</tt> * <tt>restrict</tt> pc16,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c16rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char16_t</tt> c16,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> mbrtoc32(<tt>char32_t</tt> * <tt>restrict</tt> pc32,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c32rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char32_t</tt> c32,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>

<p>Change in J.6.1 (Rule based identifiers) paragraph 2:
<blockquote>
The following
<strong><em style="background-color: yellow">** count **</em></strong>
 identifiers or keywords match these patterns and have
particular semantics provided by this document.<br/>
<br/>
[&hellip;]<br/>
<tt>atomic_char</tt><br/>
<ins><tt>atomic_char8_t</tt></ins><br/>
<ins><tt>ATOMIC_CHAR8_T_LOCK_FREE</tt></ins><br/>
<tt>atomic_char16_t</tt><br/>
<tt>ATOMIC_CHAR16_T_LOCK_FREE</tt><br/>
[&hellip;]<br/>
</blockquote>
</p>

<p>Change in J.6.2 (Particular identifiers or keywords) paragraph 1:
<blockquote>
The following
<strong><em style="background-color: yellow">** count **</em></strong>
identifiers or keywords are not covered by the above and
have particular semantics provided by this document.<br/>
<br/>
[&hellip;]<br/>
<tt>char</tt><br/>
<ins><tt>char8_t</tt></ins><br/>
<tt>char16_t</tt><br/>
<tt>char32_t</tt><br/>
[&hellip;]<br/>
<tt>BUFSIZ</tt><br/>
<ins><tt>c8rtomb</tt></ins><br/>
<tt>c16rtomb</tt><br/>
<tt>c32rtomb</tt><br/>
[&hellip;]<br/>
<tt>mbrlen</tt><br/>
<ins><tt>mbrtoc8</tt></ins><br/>
<tt>mbrtoc16</tt><br/>
<tt>mbrtoc32</tt><br/>
[&hellip;]<br/>
</blockquote>
</p>

<h1 id="acknowledgements">Acknowledgements</h1>

<p>
Thank you to Aaron Ballman for his kind assistance facilitating interaction
with WG14.
</p>

<p>
Thank you to Richard Smith and Jens Maurer for review feedback and many
educational and helpful conversations.
</p>


<h1 id="references">References</h1>

<table id="references">
  <tr>
    <td id="ref_w3techs"><sup>[W3Techs]</sup></td>
    <td>
      "Usage of UTF-8 for websites",
      W3Techs,
      2021.<br/>
      <a href="https://w3techs.com/technologies/details/en-utf8/all/all">
      https://w3techs.com/technologies/details/en-utf8/all/all</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2596"><sup>[WG14 N2596]</sup></td>
    <td>
      JeanHeyd Meneide, Freek Wiedijk, et al.,
      "C2x Working Draft",
      WG14 N2596,
      2020.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2620"><sup>[WG14 N2620]</sup></td>
    <td>
      JeanHeyd Meneide,
      "Restartable and Non-Restartable Functions for Efficient Character Conversions | r4",
      WG14 N2620,
      2020.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2620.htm">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2620.htm</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2654"><sup>[WG14 N2654]</sup></td>
    <td>
      Jens Gustedt,
      "Revise spelling of keywords v5",
      WG14 N2654,
      2021.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2654.pdf">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2654.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2724"><sup>[WG14 N2724]</sup></td>
    <td>
      JeanHeyd Meneide,
      "Not-So-Magic - typeof(...) in C | r3",
      WG14 N2724,
      2021.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2734"><sup>[WG14 N2734]</sup></td>
    <td>
      Jens Gustedt,
      "Improve type generic programming",
      WG14 N2734,
      2021.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2734.pdf">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2734.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2735"><sup>[WG14 N2735]</sup></td>
    <td>
      Jens Gustedt,
      "Type inference for variable definitions and function returns",
      WG14 N2735,
      2021.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2735.pdf">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2735.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_wg14_n2738"><sup>[WG14 N2738]</sup></td>
    <td>
      Jens Gustedt,
      "Type-generic lambdas",
      WG14 N2738,
      2021.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2738.pdf">
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2738.pdf</a></td>
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2695.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_wg21_n2249"><sup>[WG21 N2249]</sup></td>
    <td>
      Lawrence Crowl,
      "New Character Types in C++",
      WG21 N2249,
      2007.<br/>
      <a href="https://wg21.link/n2249">
      https://wg21.link/n2249</a></td>
  </tr>
  <tr>
    <td id="ref_wg21_p0482r6"><sup>[WG21 P0482R6]</sup></td>
    <td>
      Tom Honermann,
      "char8_t: A type for UTF-8 characters and strings (Revision 6)",
      WG21 P0482R6,
      2018.<br/>
      <a href="https://wg21.link/p0482r6">
      https://wg21.link/p0482r6</a></td>
  </tr>
  <tr>
    <td id="ref_clang_overloadable"><sup>[Clang overloadable]</sup></td>
    <td>
      The Clang Team,
      "Clang 11 documentation, Attributes in Clang",
      2020.<br/>
      <a href="https://releases.llvm.org/11.0.0/tools/clang/docs/AttributeReference.html#overloadable">
      https://releases.llvm.org/11.0.0/tools/clang/docs/AttributeReference.html#overloadable</a></td>
  </tr>
</table>

</body>