-
Notifications
You must be signed in to change notification settings - Fork 0
/
n2653.html
1479 lines (1354 loc) · 54.2 KB
/
n2653.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>char8_t: A type for UTF-8 characters and strings (Revision 1)</title>
<style type="text/css">
pre {
display: inline;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
text-decoration: underline;
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFEBFF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
blockquote.quote
{
margin-top: 0em;
margin-left: 0em;
border-style: solid;
background-color: lemonchiffon;
color: #000000;
border: 1px solid black;
}
div.compare {
padding-left: 40px;
display: table; /* undo float:left effect */
}
div.compare_item {
float: left;
margin: 2px;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Proposal for C2x</th>
</tr>
<tr>
<th>WG14 N2653</th>
</tr>
<tr>
<th/>
</tr>
<tr>
<th>Title:</th>
<td>char8_t: A type for UTF-8 characters and strings (Revision 1)</td>
</tr>
<tr>
<th>Revises:</th>
<td><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">N2231</a></td>
</tr>
<tr>
<th>Author:</th>
<td>Tom Honermann <[email protected]></td>
</tr>
<tr>
<th>Date:</th>
<td>2021-06-04</td>
</tr>
<tr>
<th>Proposal category:</th>
<td>New features, change to existing features</td>
</tr>
<tr>
<th>Target audience:</th>
<td>Developers working on combined C and C++ code bases</td>
</tr>
</table>
<p>
<strong>Abstract:</strong> C++20, through the adoption of
<a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="https://wg21.link/p0482r6">
WG21 P0482R6</a>
<sup><a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="#ref_wg21_p0482r6">
[WG21 P0482R6]</a></sup>,
added a new <tt>char8_t</tt> fundamental type, changed the character
type of <tt>u8</tt> character and string literals from <tt>char</tt> to
<tt>char8_t</tt>, and added the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt>
functions for conversion between multibyte characters and UTF-8.
This paper proposes corresponding changes for C to add a <tt>char8_t</tt>
typedef name with type <tt>unsigned char</tt>, to change the array element
type of <tt>u8</tt> string literals from <tt>char</tt> to <tt>unsigned char</tt>
(<tt>u8</tt> character literals already have type <tt>unsigned char</tt>),
and to add the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> functions.
These changes are intended to maintain compatibility between C and C++ and
to improve portable support for UTF-8.
</p>
<ul>
<li><a href="#changes_since_n2231">
Changes since N2231</a></li>
<li><a href="#introduction">
Introduction</a></li>
<li><a href="#motivation">
Motivation</a></li>
<li><a href="#design_options">
Design Options</a></li>
<ul>
<li><a href="#do_char8_t_type">
The <tt>char8_t</tt> type: typedef name vs a new integer type</li>
<li><a href="#do_char8_t_underlying_type">
The underlying type of <tt>char8_t</tt></li>
<li><a href="#do_u8_string_lit_type">
UTF-8 string literal type</li>
<li><a href="#do_char_array_init">
<tt>char</tt> array initialization by a UTF-8 string literal</li>
</ul>
</li>
<li><a href="#proposal">
Proposal</a></li>
<li><a href="#backward_compat">
Backward Compatibility</a>
<ul>
<li><a href="#bc_pointer_conversion">
Pointer conversion from a UTF-8 string literal</li>
<li><a href="#bc_string_lit_element_value">
The value of a UTF-8 string literal element</li>
<li><a href="#bc_type_inference">
Type inference</li>
</ul>
</li>
<li><a href="#implementation_exp">
Implementation Experience</a></li>
<li><a href="#wording">
Formal Wording</a></li>
<li><a href="#acknowledgements">
Acknowledgements</a></li>
<li><a href="#references">
References</a></li>
</ul>
<h1 id="changes_since_n2231">Changes since <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">N2231</a></h1>
<ul>
<li>Proposal changes:
<ul>
<li>Rebased the proposed wording on
<a title="[WG14 N2596]: C2x Working Draft"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf">
WG14 N2596</a>
<sup><a title="[WG14 N2596]: C2x Working Draft"
href="#ref_wg14_n2596">
[WG14 N2596]</a></sup></li>
<li>Updated wording to address <tt>u8</tt> character literals and removed
references to
<a title="[WG14 N2198]: Adding the u8 character prefix"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
WG14 N2198</a>
since it has been incorporated in the working draft.</li>
<li>Removed drafting notes regarding
<a title="[WG14 DR 488]: c16rtomb() on wide characters encoded as multiple char16_t"
href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488">
WG14 DR 488</a>
since its resolution has now been incorporated in the working draft.</li>
<li>Removed the previously proposed change to disallow initialization of
an array of type <tt>char</tt> or <tt>signed char</tt> by a UTF-8
string literal.</li>
<li>Removed the previously proposed <tt>__STDC_UTF_8__</tt> macro since UTF-8
character and string literals and the <tt>char8_t</tt> type are intended
only for use with UTF-8.</li>
</ul>
</li>
<li>Other changes:
<ul>
<li>Rewrote the abstract to reflect that
<a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="https://wg21.link/p0482r6">
WG21 P0482R6</a>
<sup><a title="[WG21 P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="#ref_wg21_p0482r6">
[WG21 P0482R6]</a></sup>
was adopted for C++20.</li>
<li>Rewrote the <a href="#motivation">Motivation</a> section.</li>
<li>Added the <a href="#design_options">Design Options</a> section.</li>
<li>Expanded the <a href="#backward_compat">Backward Compatibility</a> section.</li>
<li>Updated the <a href="#implementation_exp">Implementation Experience</a>
section with links to completed implementations in gcc and glibc.</li>
<li>Removed use of highlight.js for code highlighting purposes.</li>
</ul>
</li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>C11 introduced support for UTF-8, 16-bit, and 32-bit encoded string
literals.
New <tt>char16_t</tt> and <tt>char32_t</tt> typedef names were added to hold
values of code units for the 16-bit and 32-bit variants, but a new type
or typedef name was not added for the UTF-8 variant.
Instead, UTF-8 string literals were specified with the same type used for
ordinary string literals; array of <tt>char</tt>.
UTF-8 is the only character encoding mandated to be supported by the C
standard for which the standard does not provide a distinctly named code
unit type.
</p>
<p>Whether <tt>char</tt> is a signed or unsigned type is implementation
defined.
Implementations that use an 8-bit signed char are at a disadvantage with
respect to working with UTF-8 encoded text since the value range of their
implementation of char does not extend to the full range of UTF-8 code unit
values; programmers working with such implementations must inject casts
to unsigned char for portable code to correctly process lead and continuation
code unit values.
</p>
<p>The lack of a distinct type and the use of a code unit type with a range
that does not portably include the full unsigned range of UTF-8 code units
presents challenges for working with UTF-8 encoded text that are not present
when working with UTF-16 or UTF-32 encoded text.
Enclosed is a proposal for a new <tt>char8_t</tt> typedef and related language
and library enhancements intended to better facilitate portable handling of
UTF-8 encoded text and to enable working with all five of the standard
mandated character encodings in a consistent manner.
</p>
<h1 id="motivation">Motivation</h1>
<p>
As of February 2021,
<a title="Usage of UTF-8 for websites"
href="https://w3techs.com/technologies/details/en-utf8/all/all">
UTF-8 is now used by more than 96% of all websites</a>
<sup><a title="Usage of UTF-8 for websites"
href="#ref_w3techs">
[W3Techs]</a></sup>.
While UTF-8 now dominates websites, it has not attained similar adoption
rates in the execution environments of C and C++ programs.
Microsoft has introduced several ways in which a program can opt-in to use of
UTF-8 as the Active Code Page (ACP) starting with the April 2018 update of
Windows 10, but, by default, the ACP remains dependent on region settings.
Most POSIX systems, including Linux and macOS, use UTF-8 as the system
encoding by default, but continue to support changing the execution environment
encoding via locale related environment variables like <tt>LC_ALL</tt>.
Systems built on EBCDIC, like IBM's z/OS, continue to remain significant players
in the C and C++ ecosystems.
</p>
<p>
Programs that consume or produce UTF-8 text and text for which the encoding is
dependent on the execution environment must choose one of a few approaches to
manage text represented in these potentially distinct encodings:
<ul>
<li>Use <tt>char</tt> for all text and meticulously track which encoding
is to be used at all times.</li>
<li>Use <tt>char</tt> for all text, but meticulously convert to or from UTF-8
when interacting with the environment so that text is always represented
as UTF-8 within a component.</li>
<li>Use <tt>char</tt> when working with text for the execution environment,
and a different type, generally <tt>unsigned char</tt>, for UTF-8
encoded text.</li>
</ul>
</p>
<p>
The challenge with the first two approaches is ensuring that text is
appropriately tagged and converted as it flows through the program.
Since the same type, <tt>char</tt>, is used as the code unit type for all
text, the programmer is unable to rely on the type system to help identify
when text has not been appropriately converted.
</p>
<p>
The challenge with the third approach is the lack of a common type that
unambiguously denotes UTF-8 text across components.
Within a program, even if there is agreement on an alternate type to use,
UTF-8 string literals still have type array of <tt>char</tt>, not the
agreed upon type.
</p>
<p>
The adoption of a <tt>char8_t</tt> type via
<a title="[P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="https://wg21.link/p0482r6">
P0482R6</a>
<sup><a title="[P0482R6]: char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="#ref_wg21_p0482r6">
[P0482R6]</a></sup>
for C++20 provided a common type tailored for use with UTF-8 text.
Adoption of a similar type for C would facilitate source code compatibility
between C and C++20, establish a standard common type for programmers that
prefer the third approach above, and provide consistent behavior across
implementations without the difficulties imposed by the implementation-defined
signedness of <tt>char</tt>.
</p>
<p>
Consider the following function that purports to check whether a pointer
points to UTF-8 text that begins with a UTF-8 leading byte.
UTF-8 leading bytes have values in the range 192 (0xC0) to 255 (0xFF) though
not all values in that range may appear in valid UTF-8 encoded text.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char *s) {
return *s >= 0xC0;
}
</pre></blockquote>
For implementations that define <tt>char</tt> as either an unsigned type or
with a size greater than 8 bits, this function will correctly classify its
inputs (assuming no invalid values).
However, for implementations that define <tt>char</tt> as a signed 8-bit type
with a two's complement representation and a range of -128 (-0x80) to
127 (0x7F), the values of UTF-8 leading bytes become negative values with
the result that this function always returns false.
For the function to behave consistently across implementations, it must be
modified to ensure the comparison is performed with an unsigned type.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char *s) {
return (unsigned char)*s >= 0xC0;
}
</pre></blockquote>
The introduction of a <tt>char8_t</tt> type that behaves as an unsigned type
would allow the function to be simply implemented as follows such that it
behaves the same for all C and C++20 implementations.
<blockquote><pre>
bool starts_with_utf8_leading_byte(const char8_t *s) {
return *s >= 0xC0;
}
</pre></blockquote>
</p>
<p>
Functions like the <tt>starts_with_utf8_leading_byte()</tt> example above
are not frequently written and the problem exhibited can be easily
discovered and corrected during testing.
However, more insidious problems may be encountered in other cases, such as
with the <tt><ctype.h></tt> character classification functions.
Consider the following code that naively attempts to convert its input to
uppercase using <tt>toupper()</tt>.
<blockquote><pre>
void convert_to_uppercase(char *p) {
for (; *p; ++p) {
*p = toupper(*p);
}
}
</pre></blockquote>
</p>
<p>
When called with a UTF-8 encoded string that contains non-ASCII characters,
this function encounters undefined behavior for implementations with an 8-bit
signed <tt>char</tt> type; even when the current locale is UTF-8-based.
The problem is that lead and continuation UTF-8 code unit values are negative
for such implementations and may result in a sign extended negative value
(that does not match <tt>EOF</tt>) being passed to <tt>toupper()</tt>.
The result is undefined behavior according to
C17 7.4, "Character handling <ctype.h>", paragraph 1:
<div style="margin-left: 1em;">
<blockquote class="quote">
The header <tt><ctype.h></tt> declares several functions useful for
classifying and mapping characters.<sup>202)</sup>
In all cases the argument is an <tt>int</tt>, the value of which shall be
representable as an <tt>unsigned char</tt> or shall equal the value of the
macro <tt>EOF</tt>.
If the argument has any other value, the behavior is undefined.
</blockquote>
</div>
For this code to portably work as intended, the argument to <tt>toupper()</tt>
must be cast to <tt>unsigned char</tt>.
Alternatively, changing the type of the <tt>convert_to_uppercase()</tt>
parameter to the proposed <tt>char8_t</tt> type would portably correct the
code while also signifying that the intended input is UTF-8.
</p>
<h1 id="design_options">Design Options</h1>
<h2 id="do_char8_t_type">The <tt>char8_t</tt> type: typedef name vs a new integer type</h2>
<p>
When the <tt>char16_t</tt> and <tt>char32_t</tt> types were introduced in C11
and C++11, a choice was faced whether to introduce them as typedef names of
existing types or as new integer types.
The WG14 and WG21 committees chose different directions; WG14 opted for
typedef names for C and WG21 opted for new integer types for C++.
This choice was consistent with prior choices regarding the <tt>wchar_t</tt>
type.
The same choice applies for the introduction of a <tt>char8_t</tt> type.
</p>
<p>
The <tt>char16_t</tt> and <tt>char32_t</tt> types were added to C++11 by the
adoption of
<a title="[WG21 N2249]: New Character Types in C++"
href="https://wg21.link/n2249">
WG21 N2249</a>
<sup><a title="[WG21 N2249]: New Character Types in C++"
href="#ref_wg21_n2249">
[WG21 N2249]</a></sup>.
The motivation for new integer types stated in that proposal includes the
ability to support function overloading and template specialization; abilities
that would not be possible, at least not reliably and portably, if the new
types were simply typedef names of existing types.
At the time these types were adopted, C did not yet have support for generic
programming; the <tt>_Generic</tt> generic selection expression had not yet
been adopted.
Thus, there was little to no motivation for WG14 to impose the additional effort
required to support new integer types on implementors.
</p>
<p>
WG14 now has several proposals to improve support for generic programming in C:
<ul>
<li><a title="WG14 N2734: Improve type generic programming"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2734.pdf">
WG14 N2734: Improve type generic programming</a>
<sup><a title="[WG14 N2734]: Improve type generic programming"
href="#ref_wg14_n2734">
[WG14 N2734]</a></sup></li>
<li><a title="WG14 N2724: Not-So-Magic - typeof(...) in C | r3"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2724.htm">
WG14 N2724: Not-So-Magic - typeof(...) in C | r3</a>
<sup><a title="[WG14 N2724]: Not-So-Magic - typeof(...) in C | r3"
href="#ref_wg14_n2724">
[WG14 N2724]</a></sup></li>
<li><a title="WG14 N2738: Type-generic lambdas"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2738.pdf">
WG14 N2738: Type-generic lambdas</a>
<sup><a title="[WG14 N2738]: Type-generic lambdas"
href="#ref_wg14_n2738">
[WG14 N2738]</a></sup></li>
<li><a title="WG14 N2735: Type inference for variable definitions and function returns"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2735.pdf">
WG14 N2735: Type inference for variable definitions and function returns</a>
<sup><a title="[WG14 N2735]: Type inference for variable definitions and function returns"
href="#ref_wg14_n2735">
[WG14 N2735]</a></sup></li>
</ul>
Desire for generic programming improvements may translate to additional
motivation for distinct integer types for character data.
The following example illustrates a potential use case that would be enabled
by distinct types.
<blockquote><pre>
void send_narrow(const char*);
void send_wide(const wchar_t*);
void send_utf8(const char8_t*);
void send_utf16(const char16_t*);
void send_utf32(const char32_t*);
#define send(X) \
_Generic((X), \
char*: send_narrow, \
wchar_t*: send_wide, \
char8_t*: send_utf8, \
char16_t*: send_utf16, \
char32_t*: send_utf32)(X)
void f() {
send(L"text"); /* Would be ok with distinct types; calls send_wide(). */
send(u8"text"); /* Would be ok with distinct types; calls send_utf8(). */
}
</pre></blockquote>
</p>
<p>
Clang supports an
<a title="Clang 11 documentation, Attributes in Clang"
href="https://releases.llvm.org/11.0.0/tools/clang/docs/AttributeReference.html#overloadable">
extension that enables overloading in C</a>
<sup><a title="Clang 11 documentation, Attributes in Clang"
href="#ref_clang_overloadable">
[Clang overloadable]</a></sup>.
If adopted by WG14, the code above could be more simply written as:
<blockquote><pre>
void __attribute__((overloadable)) send(const char*);
void __attribute__((overloadable)) send(const wchar_t*);
void __attribute__((overloadable)) send(const char8_t*);
void __attribute__((overloadable)) send(const char16_t*);
void __attribute__((overloadable)) send(const char32_t*);
void f() {
send(L"text"); /* Would be ok with distinct types; calls send(const wchar_t*). */
send(u8"text"); /* Would be ok with distinct types; calls send(const char8_t*). */
}
</pre></blockquote>
</p>
<p>
Additional motivation for distinct integer types is the ability to specify
them as non-aliasing types.
A non-aliasing type is one for which objects of the type may only be accessed
using a limited set of types; compatible types and specially designated types
like <tt>char</tt> and <tt>unsigned char</tt>.
Compilers may use type based alias analysis (TBAA) to generate more efficient
code for non-aliasing types.
Aliasing violations result in undefined behavior.
</p>
<p>
The following example code would be well-formed in C regardless of whether
<tt>char8_t</tt> is specified as a new integer type or as a typedef name of
an existing character type.
If <tt>char8_t</tt> is specified as a typedef name of an existing character
type, then the example also works as expected because it does not violate
aliasing rules.
However, if <tt>char8_t</tt> is specified as a new integer type, then the
example would exhibit undefined behavior because an object of type
<tt>char</tt> is accessed using the <tt>char8_t</tt> type (assuming no new
special provisions added to C17 6.5, Expressions, paragraph 7).
Thus, there is a trade-off between code efficiency and safety inherent in
how <tt>char8_t</tt> is defined.
<blockquote><pre>
void do_utf8_things(const char8_t *s) {
*s;
}
void f() {
const char *presumably_utf8_text = "text";
do_utf8_things(presumably_utf8_text);
}
</pre></blockquote>
</p>
<p>
Since <tt>char8_t</tt> is a distinct type in C++ and the C++ type system
prohibits implicit access to objects with an incompatible type without use of
a cast, the above example is ill-formed in C++20.
However, the code may be rendered well-formed in C++20 by the addition of a
cast, but will then result in undefined behavior when executed.
<blockquote><pre>
void do_utf8_things(const char8_t *s) {
*s;
}
void f() {
const char *presumably_utf8_text = "text";
do_utf8_things((const char8_t*)presumably_utf8_text);
}
</pre></blockquote>
Such a cast might be added by a C programmer in order to silence warnings
regarding a change of signedness that might be produced when the
<tt>const char*</tt> argument to <tt>do_utf8_things()</tt> is converted to
<tt>const char8_t*</tt>; assuming <tt>char8_t</tt> is a typedef name of a
differently signed character type (otherwise, if <tt>char8_t</tt> were a
distinct type, the code would exhibit undefined behavior whether or not
the cast was present).
In that case, the unfortunate result is that the code is well-formed for
both C and C++, but exhibits undefined behavior only when compiled for C++.
</p>
<p>
This aliasing asymmetry between C and C++ is not a new concern; it already
exists for the <tt>wchar_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt>
types.
For example, <tt>char16_t</tt> and <tt>uint_least16_t</tt> are distinct
integer types in C++ (and do not alias), but are the same type in C.
Whether these aliasing issues are more significant for <tt>char8_t</tt> as
opposed to the other character types is a subjective concern.
</p>
<p>
Introduction of a new <tt>char8_t</tt> integer type without a corresponding
change to make <tt>wchar_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt>
distinct integer types would be inconsistent and surprising.
While the author sees potential use for distinct types as shown above, such
a change of direction should be pursued via a separate proposal.
Should WG14 indicate support for such direction when reviewing this proposal,
the author will submit a separate proposal.
In the meantime, this proposal advocates for only a new <tt>char8_t</tt>
typedef name in order to maintain consistency with the existing character
types.
</p>
<p>
<b>Proposed: a new <tt>char8_t</tt> typedef name defined in the <tt>uchar.h</tt>
header.</b>
</p>
<h2 id="do_char8_t_underlying_type">The underlying type of <tt>char8_t</tt></h2>
<p>
UTF-8 code unit values range from 0x00 to 0xF5 (the values 0xC0, 0xC1, and 0xF5
through 0xFF do not occur in well-formed UTF-8 code unit sequences) and
therefore require at least an 8-bit type for storage.
</p>
<p>
The existing <tt>char16_t</tt> and <tt>char32_t</tt> typedef names are defined
as having the same type as <tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>
respectively.
This suggests that the underlying type of <tt>char8_t</tt> should be the same
type as <tt>uint_least8_t</tt>.
However, the latitude provided for the <tt>uint_least8_t</tt> typedef name to
be defined with a type other than <tt>unsigned char</tt> provides no benefit
for the proposed <tt>char8_t</tt> type; <tt>unsigned char</tt> is already
defined to be unsigned with a size and alignment of 1 byte.
Since bytes are constrained to be at least 8-bits and no smaller types are
possible, additional leniency would only serve to limit portability.
</p>
<p>
The type of character constants with a <tt>u8</tt> <i>encoding-prefix</i> is
already <tt>unsigned char</tt>.
The underlying type for <tt>char8_t</tt> in C++20 is also
<tt>unsigned char</tt>.
For consistency with <tt>u8</tt> character constants and the C++20
<tt>char8_t</tt> type, this proposal defines the underlying type of the
proposed <tt>char8_t</tt> type to be <tt>unsigned char</tt>.
</p>
<p>
<b>Proposed: The underlying type of <tt>char8_t</tt> is
<tt>unsigned char</tt>.</b>
</p>
<h2 id="do_u8_string_lit_type">UTF-8 string literal type</h2>
<p>
In C17, a UTF-8 string literal has type array of <tt>char</tt>.
Since the size and signedness of <tt>char</tt> are implementation-defined,
portable code requires casts to an unsigned type when reading UTF-8
code unit values stored in objects of type <tt>char</tt>.
This is required because common implementations implement <tt>char</tt>
as a signed 8-bit type for which integer promotion rules produce a
negative value for leading and trailing code unit values (which all have
values above 0x7F).
While it is uncommon for code to directly access the elements of a string
literal, such accesses may occur when macros are involved.
</p>
<p>
In the working draft, UTF-8 character constants have a type of
<tt>unsigned char</tt>.
That results in a surprising inconsistency with UTF-8 string literals.
<blockquote><pre>
#define M(X) ((X) >= 0x80)
void f() {
M(u8"\U00E9"[0]); /* True for some implementations, false for others.
U+00E9 is encoded as 0xC3 0xA9 in UTF-8.
0xC3 will promote to a negative integer value for
implementations with a signed 8-bit char type. */
M(u8'\xC3'); /* True for all implementations. */
}
</pre></blockquote>
Changing the type of a UTF-8 string literal to an array of type <tt>char8_t</tt>
would avoid this inconsistency such that both expressions above would result in
a true value for all implementations.
</p>
<p>
For consistency with <tt>u8</tt> character constants and the type of C++20
UTF-8 string literals, this proposal changes the type of a UTF-8 string literal
from array of <tt>char</tt> to array of <tt>char8_t</tt>.
The <a href="#backward_compat">Backward Compatibility</a> section discusses the
impact of this change.
</p>
<p>
<b>Proposed: The type of UTF-8 string literals is changed from array of
<tt>char</tt> to array of <tt>char8_t</tt>.</b>
</p>
<h2 id="do_char_array_init"><tt>char</tt> array initialization by a UTF-8 string literal</h2>
<p>
In C17, arrays of type <tt>char</tt>, <tt>signed char</tt>, and
<tt>unsigned char</tt> may be initialized by a UTF-8 string
literal. These were all made ill-formed in C++20 where only
arrays of <tt>char8_t</tt> may be initialized by a UTF-8 string
literal.
<blockquote><pre>
const char cu8[] = u8"text"; /* Ok in C17 and C++17, ill-formed in C++20. */
const signed char scu8[] = u8"text"; /* Ok in C17 and C++17, ill-formed in C++20. */
const unsigned char ucu8[] = u8"text"; /* Ok in C17 and C++17, ill-formed in C++20. */
</pre></blockquote>
</p>
<p>
For other character types, whether an array of the character type can be
initialized by a string literal with a mismatched encoding prefix depends
on the implementation.
C17 6.7.9, "Initialization", paragraph 15 states:
<div style="margin-left: 1em;">
<blockquote class="quote">
An array with element type compatible with a qualified or unqualified version
of <tt>wchar_t</tt>, <tt>char16_t</tt>, or <tt>char32_t</tt> may be initialized
by a wide string literal with the corresponding encoding prefix (<tt>L</tt>,
<tt>u</tt>, or <tt>U</tt>, respectively), optionally enclosed in braces.
Successive wide characters of the wide string literal (including the
terminating null wide character if there is room or if the array is of unknown
size) initialize the elements of the array.
</blockquote>
</div>
C++ does not allow initialization of mismatched encoding prefixes.
<blockquote><pre>
const wchar_t wc16[] = u"text"; /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const wchar_t wc32[] = U"text"; /* Ok in C17 if wchar_t and char32_t are compatible types, ill-formed in C++20. */
const char16_t c16w[] = L"text"; /* Ok in C17 if wchar_t and char16_t are compatible types, ill-formed in C++20. */
const char32_t c32w[] = L"text"; /* Ok in C17 if char32_t and wchar_t are compatible types, ill-formed in C++20. */
</pre></blockquote>
</p>
<p>
Prohibiting initialization of arrays of type <tt>char</tt> and
<tt>signed char</tt> by UTF-8 string literals would improve consistency with
C++20.
However, the existing inconsistencies are fully explainable as a consequence of
the choice to use existing integer types for wide character types in C vs the
choice to introduce new integer types in C++.
If WG14 were to decide to switch to use of distinct integer types for wide
character types (and <tt>char8_t</tt>) in the future, then it would make sense
to align initialization allowances with C++.
Until then, this proposal preserves the existing ability to initialize an
array of plain <tt>char</tt> or an array of <tt>signed char</tt>
with a UTF-8 string literal.
</p>
<p>
<b>Proposed: initialization of an array of type <tt>char</tt> or an array of
type <tt>signed char</tt> by a UTF-8 string literal remains well-formed.</b>
</p>
<h1 id="proposal">Proposal</h1>
<p>The proposed changes include:
<ul>
<li>A new <tt>char8_t</tt> typedef name with type <tt>unsigned char</tt>
defined in the <tt><uchar.h></tt> header.</li>
<li>The type of UTF-8 string literals is changed from array of
<tt>char</tt> to array of <tt>char8_t</tt>.</li>
<li>The type of UTF-8 character literals is changed from
<tt>unsigned char</tt> to <tt>char8_t</tt>.<br/>
(Since UTF-8 character literals already have type <tt>unsigned char</tt>,
this is not a semantic change).</li>
<li>Initialization of an array of type <tt>char</tt> or type
<tt>signed char</tt> by a UTF-8 string literal remains well-formed.</li>
<li>New <tt>mbrtoc8()</tt> and <tt>c8rtomb()</tt> functions declared in
<tt><uchar.h></tt> enable conversions between multibyte characters
and UTF-8.</li>
<li>A new <tt>ATOMIC_CHAR8_T_LOCK_FREE</tt> macro.</li>
<li>A new <tt>atomic_char8_t</tt> typedef name.</li>
</ul>
</p>
<h1 id="backward_compat">Backward Compatibility</h1>
<p>The proposed change to the type of UTF-8 string literals impacts backward
compatibility as described in the following sections.
Implementors are encouraged to offer options to disable <tt>char8_t</tt>
support when necessary to preserve compatibility with C17.
</p>
<h2 id="bc_pointer_conversion">Pointer conversion from a UTF-8 string literal</h2>
<p>
Initialization or assignment of <tt>char</tt> pointers (including
parameters) from UTF-8 string literals remains well-formed under this
proposal.
However, some implementations may produce warnings about differences in
signedness depending on whether <tt>char</tt> is a signed or unsigned type.
</p>
<p>
For example:
<blockquote><pre>
const char *p = u8"text"; // Well-formed in C17 and with this proposal, but
// implementations may now warn about different
// signedness for the pointer target type.
</pre></blockquote>
</p>
<h2 id="bc_string_lit_element_value">The value of a UTF-8 string literal element</h2>
<p>
Code that directly accesses the code unit values of UTF-8 string literals
without an intervening cast to an unsigned type may observe different values
under this proposal.
This will occur for implementations with a signed 8-bit <tt>char</tt> type
when accessing a leading or trailing UTF-8 code unit (such code units have a
value in the range <tt>0x80</tt> through <tt>0xFF</tt>).
</p>
<p>
For example:
<blockquote><pre>
if (u8"\u00E9"[0] < 0) {} // Well-formed with implementation-defined behavior
// in C17. Well-formed with portable behavior with
// this proposal (the conditional is always false).
</pre></blockquote>
</p>
<p>
The author is unaware of use cases that involve directly probing the values
of UTF-8 string literal elements, but such accesses may occur as a result of
macro processing.
Code intended to be portable will already contain an appropriate cast to an
unsigned type and will therefore be unaffected by this proposal.
Non-portable code that relies on leading and trailing UTF-8 code unit values
having a negative value will require modification.
</p>
<h2 id="bc_type_inference">Type inference</h2>
<p>
Code that makes use of <tt>_Generic</tt> expressions, type inference extensions
such as gcc's <tt>__typeof__</tt> type specifier, or Clang's extension for
overloading in C may become ill-formed or behave differently with this proposal.
</p>
<p>
In the following example, <tt>serialize</tt> is a type-generic macro that, based
on the type of its argument, dispatches to either <tt>serialize_text()</tt>,
<tt>serialize_wide_text()</tt>, <tt>serialize_int()</tt>. or
<tt>serialize_double()</tt>.
With this proposal, there is no longer a type match, so the code becomes
ill-formed.
This code can be corrected on the caller side by adding a cast to <tt>char*</tt>
or on the callee side by adding a type match for <tt>unsigned char*</tt>.
The latter approach has the benefit of allowing <tt>serialize</tt> to dispatch
to a <tt>serialize_u8text()</tt> function that specifically handles UTF-8
encoded text.
<blockquote><pre>
void serialize_text(const char*);
void serialize_wide_text(const wchar_t*);
void serialize_int(int);
void serialize_double(double);
#define serialize(X) _Generic((X), \
char*: serialize_text, \
wchar_t*: serialize_wide_text, \
int: serialize_int, \
double: serialize_double)(X)
void f() {
serialize(u8"text"); // Well-formed in C17, ill-formed with this proposal.
}
</pre></blockquote>
</p>
<p>
The following example reimplements the serialization example, using Clang's
extension for overloading in C.
In this case, the change of type for the UTF-8 string literal results in
ambiguous overload resolution.
Here again, the code can be corrected on the caller side by adding a cast, or
can be corrected on the callee side by adding an overload for
<tt>const unsigned char*</tt>.
Again, the latter has the benefit of enabling UTF-8 encoded text to be handled
differently than text matching the execution character set.
<blockquote><pre>
void serialize(const char*) __attribute__((overloadable));
void serialize(const wchar_t*) __attribute__((overloadable));
void serialize(int) __attribute__((overloadable));
void serialize(double) __attribute__((overloadable));
void f() {
serialize(u8"text"); // Well-formed in C17 with Clang's overloading extension.
// Ill-formed with this proposal.
}
</pre></blockquote>
</p>
<h1 id="implementation_exp">Implementation Experience</h1>
<p>The proposed changes have been implemented in forks of gcc and glibc and
are available in the <tt>char8_t-for-c</tt> and <tt>char8_t</tt> branches
respectively of the following repositories:
<ul>
<li>gcc: <a href="https://github.com/tahonermann/gcc/tree/char8_t-for-c">
https://github.com/tahonermann/gcc/tree/char8_t-for-c</a></li>
<li>glibc: <a href="https://github.com/tahonermann/glibc/tree/char8_t">
https://github.com/tahonermann/glibc/tree/char8_t</a></li>
</ul>
</p>
<p>
The changes to glibc provide declarations for the <tt>char8_t</tt> typedef
name and the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> functions.
When compiling for C, these declarations are only present when the
<tt>_CHAR8_T_SOURCE</tt> feature test macro is defined.
</p>
<p>
The changes to gcc provide the <tt>atomic_char8_t</tt> typedef name, the
<tt>ATOMIC_CHAR8_T_LOCK_FREE</tt> macro, and the change of type for UTF-8
literals from array of <tt>char</tt> to array of <tt>unsigned char</tt>.
The existing <tt>-fchar8_t</tt> and <tt>-fno-char8_t</tt> compiler options
are extended to C code to allow opting-in or opting-out of these changes.
When <tt>-fchar8_t</tt> is enabled, the <tt>_CHAR8_T_SOURCE</tt> macro is
defined to inform the C library that the <tt>char8_t</tt> typedef name and
the <tt>c8rtomb()</tt> and <tt>mbrtoc8()</tt> declarations should be provided
by the <tt>uchar.h</tt> header.
</p>
<h1 id="wording">Formal Wording</h1>
<input type="checkbox" id="hidedel">Hide deleted text</input>
<p>These changes are relative to
<a title="[WG14 N2596]: C2x Working Draft"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf">
WG14 N2596</a>
<sup><a title="[WG14 N2596]: C2x Working Draft"
href="#ref_wg14_n2596">
[WG14 N2596]</a></sup>
</p>
<p>Change in 6.4.4 (Character constants) paragraph 9:
<blockquote>
The value of an octal or hexadecimal escape sequence shall be in the range of
representable values for the corresponding type:
<div style="margin-left: 1em;">
<table style="border-collapse: collapse;">
<tr style="border-bottom: 1px solid black;">
<td style="border-right: 1px solid black;">Prefix</td>
<td>Corresponding type</td>
</tr>
<tr>
<td style="border-right: 1px solid black;">none</td>
<td><tt>unsigned char</tt></td>
</tr>
<tr>
<td style="border-right: 1px solid black;"><tt>u8</tt></td>
<td><tt><del>unsigned char</del><ins>char8_t</ins></tt></td>
</tr>
<tr>
<td style="border-right: 1px solid black;"><tt>L</tt></td>
<td>the unsigned type corresponding to <tt>wchar_t</tt></td>
</tr>
<tr>
<td style="border-right: 1px solid black;"><tt>u</tt></td>
<td><tt>char16_t</tt></td>
</tr>
<tr>
<td style="border-right: 1px solid black;"><tt>U</tt></td>
<td><tt>char32_t</tt></td>
</tr>
</table>
</div>
</blockquote>
</p>
<p>Change in 6.4.4 (Character constants) paragraph 12:
<blockquote>
A UTF-8 character constant has type
<tt><del>unsigned char</del><ins>char8_t</ins></tt>.
The value of a UTF-8 character constant is equal to its ISO/IEC 10646 code
point value, provided that the code point value can be encoded as a
single UTF-8 code unit.
</blockquote>
</p>
<p>Change in 6.4.5 (String Literals) paragraph 6:
<blockquote>