-
Notifications
You must be signed in to change notification settings - Fork 0
/
p0482r0.html
1712 lines (1561 loc) · 64 KB
/
p0482r0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>char8_t: A type for UTF-8 characters and strings</title>
<style type="text/css">
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
text-decoration: underline;
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFEBFF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
blockquote.code
{
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Document Number:</th>
<td>P0482R0</td>
</tr>
<tr>
<th>Date:</th>
<td>2016-10-17</td>
</tr>
<tr>
<th>Audience:</th>
<td>Evolution Working Group<br/>
Library Evolution Working Group</td>
</tr>
<tr>
<th>Reply-to:</th>
<td>Tom Honermann <[email protected]></td>
</tr>
</table>
<h1>char8_t: A type for UTF-8 characters and strings</h1>
<ul>
<li><a href="#introduction">
Introduction</a></li>
<li><a href="#motivation">
Motivation</a></li>
<li><a href="#design">
Design Considerations</a>
<ul>
<li><a href="#design_compat">
Backward compatibility
</a>
<ul>
<li><a href="#design_compat_core">
Core language backward compatibility features
</a>
<ul>
<li><a href="#design_compat_core_implicit_conversion">
Implicit conversions from UTF-8 strings to ordinary strings
</a></li>
</ul>
</li>
<li><a href="#design_compat_library">
Library backward compatibility features
</a>
<ul>
<li><a href="#design_compat_library_convert_u8string_to_string">
Implicit conversion from std::u8string to std::string
</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#design_type_deduction">
Deduced types for UTF-8 literals
</a></li>
<li><a href="#design_narrow_utf8">
Should UTF-8 literals continue to be referred to as narrow literals?
</a></li>
<li><a href="#design_char8_t_underlying_type">
What should be the underlying type of char8_t?
</a></li>
<li><a href="#design_deprecated">
Deprecated features
</a>
<ul>
<li><a href="#design_deprecated_codecvt">
<tt>codecvt</tt> and <tt>codecvt_byname</tt> specializations
</a></li>
<li><a href="#design_deprecated_u8path">
<tt>u8path</tt> path factory functions
</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#implementation_exp">
Implementation Experience</a></li>
<li><a href="#wording">
Formal Wording</a>
<ul>
<li><a href="#core_wording">
Core Wording</a></li>
<li><a href="#library_wording">
Library Wording</a></li>
</ul>
</li>
<li><a href="#acknowledgements">
Acknowledgements</a></li>
<li><a href="#references">
References</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>C++11 introduced support for UTF-8, UTF-16, and UTF-32 encoded string
literals via
<a title="N2249: New Character Types in C++"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html">
N2249
</a>
<sup><a title="N2249: New Character Types in C++"
href="#ref_n2249">
[N2249]</a></sup>.
New <tt>char16_t</tt> and <tt>char32_t</tt> types were added to hold values of
code units for the UTF-16 and UTF-32 variants, but a new type was not added for
the UTF-8 variants. Instead, UTF-8 character literals (added in C++17 via
<a title="N4197: Adding u8 character literals"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4197.html">
N4197
</a>
<sup><a title="N4197: Adding u8 character literals"
href="#ref_n4197">
[N4197]</a></sup>)
and string literals were defined in terms of the <tt>char</tt> type used for
the code unit type of ordinary character and string literals. UTF-8 is the
only text encoding mandated to be supported by the C++ standard for which there
is no distinct code unit type. Lack of a distinct type for UTF-8 encoded
character and string literals prevents the use of overloading and template
specialization in interfaces designed for interoperability with encoded text.
The inability to infer an encoding for narrow characters and strings limits
design possibilities and hinders the production of elegant interfaces that work
seemlessly in generic code. Library authors must choose to limit encoding
support, design interfaces that require users to explicitly specify encodings,
or provide distinct interfaces for, at least, the implementation defined
execution and UTF-8 encodings.</p>
<p>Whether <tt>char</tt> is a signed or unsigned type is implementation defined
and implementations that use an 8-bit signed char are at a disadvantage with
respect to working with UTF-8 encoded text due to the necessity of having to
rely on conversions to unsigned types in order to correctly process leading and
continuation code units of multi-byte encoded code points.</p>
<p>The lack of a distinct type and the use of a code unit type with a range that
does not portably include the full unsigned range of UTF-8 code units presents
challenges for working with UTF-8 encoded text that are not present when working
with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new
<tt>char8_t</tt> fundamental type and related library enhancements intended to
remove barriers to working with UTF-8 encoded text and to enable generic
interfaces that work with all five of the standard mandated text encodings in a
consistent manner.</p>
<p>This proposal is incomplete as the author ran out of time preparing it for
the Issaquah mailing deadline. The following are known deficiencies that are
expected to be addressed in a future revision of this proposal.
<ul>
<li>Backward compatibility is not adequately addressed. There is some
discussion in the design considerations section, but no provisions
addressing backward compatibility are currently present in the
wording. The proposed changes effectively bring the standard to the
state the author feels it would likely be in had <tt>char8_t</tt> been
added at the same time as <tt>char16_t</tt> and <tt>char32_t</tt>
were.</li>
<li>An implementation of the proposed changes is not yet available for
assessing the impact to backward compatibility.</li>
<li>The claim that a new type may allow compilers to better optimize code that
works with UTF-8 strings is unsubstantiated.</li>
<li>Wording updates for clauses C and D have not yet been provided.</li>
<li>Impact to other proposals such as
<a title="P0353R0: Unicode Encoding Conversions for the Standard Library"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0353r0.html">
P0353R0</a>
<sup><a title="P0353R0: Unicode Encoding Conversions for the Standard Library"
href="#ref_p0353r0">
[P0353R0]</a></sup> is not discussed.
</li>
</ul>
<h1 id="motivation">Motivation</h1>
<p>Consider the following string literal expressions, all of which encode
<tt>U+0123</tt>, <tt>LATIN SMALL LETTER G WITH CEDILLA</tt>:
<blockquote class="code">
<tt><pre>
u8"\u0123" // UTF-8: const char[]: 0xC4 0xA3 0x00
u"\u0123" // UTF-16: const char16_t[]: 0x0123 0x0000
U"\u0123" // UTF-32: const char32_t[]: 0x00000123 0x00000000
"\u0123" // ???: const char[]: ???
L"\u0123" // ???: const wchar_t[]: ???
</pre></tt>
</blockquote>
The UTF-8, UTF-16, and UTF-32 string literals have well-defined and portable
sequences of code unit values. The ordinary and wide string literal code unit
sequences depend on the implementation defined execution and execution wide
encodings respectively. Code that is designed to work with text encodings must
be able to differentiate these strings. This is straight forward for wide,
UTF-16, and UTF-32 string literals since they each have a distinct code unit
type suitable for differentiation via function overloading or template
specialization. But for ordinary and UTF-8 string literals, differentiating
between them requires additional information since they have the same code unit
type. That additional information might be provided implicitly via differently
named functions, or explicitly via additional function or template
arguments. For example:</p>
<blockquote class="code">
<tt><pre>
// Differentiation by function name:
void do_x(const char *);
void do_x_utf8(const char *);
// Differentiation by suffix for user-defined literals:
int operator ""_udl(const char *s, std::size_t);
int operator ""_udl_utf8(const char *s, std::size_t);
// Differentiation by function parameter:
void do_x(const char *, bool is_utf8);
// Differentiation by template parameter:
template<bool IsUTF8>
void do_x(const char *);
</pre></tt>
</blockquote>
<p>The requirement to, in some way, specify the text encoding, other than
through the type of the string, limits the ability to provide elegant encoding
sensitive interfaces. Consider the following invocations of the
<tt>make_text_view</tt> function proposed in
<a title="P0244R1: Text_view: A C++ concepts and range based character encoding
and code point enumeration library"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0244r1.html">
P0244R1</a>
<sup><a title="P0244R1: Text_view: A C++ concepts and range based character
encoding and code point enumeration library"
href="#ref_p0244r1">
[P0244R1]</a></sup>:
<blockquote class="code">
<tt><pre>
make_text_view<execution_character_encoding>("text")
make_text_view<execution_wide_character_encoding>(L"text")
make_text_view<utf8_encoding>(u8"text")
make_text_view<utf16_encoding>(u"text")
make_text_view<utf32_encoding>(U"text")
</pre></tt>
</blockquote>
For each invocation, the encoding of the string literal is known at compile
time, so having to explicitly specify the encoding tag feels redundant. If
UTF-8 strings had a distinct type, then the encoding type could be inferred,
while still allowing an overriding tag to be supplied:
<blockquote class="code">
<tt><pre>
make_text_view("text") // defaults to execution_character_encoding.
make_text_view(L"text") // defaults to execution_wide_character_encoding.
make_text_view(u8"text") // defaults to utf8_encoding.
make_text_view(u"text") // defaults to utf16_encoding.
make_text_view(U"text") // defaults to utf32_encoding.
make_text_view<utf16be_encoding>("\0t\0e\0x\0t\0") // Default overridden.
</pre></tt>
</blockquote>
<p>The inability to infer an encoding for narrow strings doesn't just limit the
interfaces of new features under consideration. Compromised interfaces are
already present in the standard library.</p>
<p>Consider the design of the <tt>codecvt</tt> class template. The standard
specifies the following specializations of <tt>codecvt</tt> be provided to
enable transcoding text from one encoding to another.
<blockquote class="code">
<tt><pre>
codecvt<char, char, mbstate_t> <em>// #1</em>
codecvt<wchar_t, char, mbstate_t> <em>// #2</em>
codecvt<char16_t, char, mbstate_t> <em>// #3</em>
codecvt<char32_t, char, mbstate_t> <em>// #4</em>
</pre></tt>
</blockquote>
#1 performs no conversions. #2 converts between strings encoded in the
implementation defined wide and narrow encodings. #3 and #4 convert between
either the UTF-16 or UTF-32 encoding and the UTF-8 encoding. Specializations
are not currently specified for conversion between the implementation defined
narrow and wide encodings and any of the UTF-8, UTF-16, or UTF-32 encodings.
However, if support for such conversions were to be added, the desired
interfaces are already taken by #1, #3 and #4.</p>
<p>The file system interface adopted for C++17 via
<a title="P0218R1: Adopt the File System TS for C++17"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0372r0.html">
P0218R1</a>
<sup><a title="P0218R1: Adopt the File System TS for C++17"
href="#ref_p0218r1">
[P0218R1]</a></sup>
provides an example of a feature that supports all five of the standard mandated
encodings, but does so with an asymetric interface due to the inability to
overload functions for UTF-8 encoded strings. Class
<tt>std::filesystem::path</tt> provides the following constructors to initialize
a <tt>path</tt> object based on a range of code unit values where the encoding
is inferred based on the value type of the range.
<blockquote class="code">
<tt><pre>
template <class Source>
path(const Source& source);
template <class InputIterator>
path(InputIterator first, InputIterator last);
</pre></tt>
</blockquote>
<p>§ 27.10.8.2.2 [path.type.cvt] describes how the source encoding is determined
based on whether the source range value type is <tt>char</tt>, <tt>wchar_t</tt>,
<tt>char16_t</tt>, or <tt>char32_t</tt>. A range with value type <tt>char</tt>
is interpreted using the implementation defined narrow execution encoding. It
is not possible to construct a path object from UTF-8 encoded text using these
constructors.
<p>To accommodate UTF-8 encoded text, the file system library specifies the
following factory functions. Matching factory functions are not provided for
other encodings.
<blockquote class="code">
<tt><pre>
template <class Source>
path u8path(const Source& source);
template <class InputIterator>
path u8path(InputIterator first, InputIterator last);
</pre></tt>
</blockquote>
<p>The requirement to construct <tt>path</tt> objects using one interface for
UTF-8 strings vs another interface for all other supported encodings creates
unnecessary difficulties for portable code. Consider an application that uses
UTF-8 as its internal encoding on POSIX systems, but uses UTF-16 on Windows.
Conditional compilation or other abstractions must be implemented and used
in otherwise platform neutral code to construct <tt>path</tt> objects.</p>
<p>The inability to infer an encoding based on string type is not the only
challenge posed by use of <tt>char</tt> as the UTF-8 code unit type. The
following code exhibits implementation defined behavior.
<blockquote class="code">
<tt><pre>
bool is_utf8_multibyte_code_unit(char c) {
return c >= 0x80;
}
</pre></tt>
</blockquote>
</p>
<p>UTF-8 leading and continuation code units have values in the range 128
(0x80) to 255 (0xFF). In the common case where <tt>char</tt> is implemented
as a signed 8-bit type with a two's complement representation and a range of
-128 (-0x80) to 127 (0x7F), these values exceed the unsigned range of the
<tt>char</tt> type. Such implementations typically encode such code units as
unsigned values which are then reinterpreted as signed values when read. In
the code above, integral promotion rules result in <tt>c</tt> being promoted to
type <tt>int</tt> for comparison to the <tt>0x80</tt> operand. if <tt>c</tt>
holds a value corresponding to a leading or continuation code unit value, then
its value will be interpreted as negative and the promoted value of type
<tt>int</tt> will likewise be negative. The result is that the comparison
is always false for these implementations.</p>
<p>To correct the code above, explicit conversions are required. For example:
<blockquote class="code">
<tt><pre>
bool is_utf8_multibyte_code_unit(char c) {
return static_cast<unsigned char>(c) >= 0x80;
}
</pre></tt>
</blockquote>
</p>
<p>Finally, processing of UTF-8 strings is currently subject to an optimization
pessimization due to glvalue expressions of type <tt>char</tt> potentially
aliasing objects of other types. Use of a distinct type that does not share
this aliasing behavior may allow for further compiler optimizations.</p>
<h1 id="design">Design Considerations</h1>
<h2 id="design_compat">Backward compatibility</h2>
<p>This proposal does not specify any backward compatibility features other than
to retain interfaces that it deprecates. The lack of such features is not due
to a belief that backward compatibility features are not necessary. The author
believes such features are necessary, but time constraints prevented adequately
researching what issues must be addressed, to what degree they must be
addressed, and how those features should be specified. The author intends to
address these concerns in a future revision of this document. In the meantime,
the following sections discuss some of the backward compatibility impact and
possible solution directions.</p>
<h3 id="design_compat_core">Core language backward compatibility features</h3>
<h4 id="design_compat_core_implicit_conversion">
Implicit conversions from UTF-8 strings to ordinary strings</h4>
<p>It may be necessary to allow implicit conversions for UTF-8 string literals
from <tt>const char8_t[]</tt> to <tt>const char[]</tt> to allow currently
well-formed code like the following to remain well-formed:
<blockquote class="code">
<tt><pre>
template<typename T> void f(const T*);
void f(const char*);
f(u8"text"); // Ok, calls f(const char*).
...
char u8a[] = u8"text"; // Ok.
const char (&u8r)[] = u8"text"; // Ok.
const char *u8s = u8"text"; // Ok.
</pre></tt>
</blockquote>
</p>
<p>It may also be necessary to permit implicit conversions for non-literal UTF-8
strings:
<blockquote class="code">
<tt><pre>
const auto *u8s = u8"text"; // C++14: Ok, type deduced to <tt>const char*</tt>.
// This proposal: Ok, type deduced to <tt>const char8_t*</tt>.
const char *s = u8s; // C++14: Ok, <tt>u8s</tt> has type <tt>const char*</tt>.
// This proposal: An implicit conversion from <tt>const char8_t*</tt>
// to <tt>const char*</tt> would be required for this assignment
// to remain well-formed.
</pre></tt>
</blockquote>
</p>
<p>If such implicit conversions are found to be necessary, specifying them may
present a small challenge. The standard conversion sequence might have to be
modified to allow a data representation conversion prior to an lvalue
transformation in order for an argument of, for example, array of
<tt>char8_t</tt> to match a parameter of type <tt>char*</tt>. However, the
standard conversion sequence, as described in § 13.3.3.1.1 [over.ics.scs],
states that lvalue transformations, including the array-to-pointer conversion,
are performed before promotions and conversions that might change the data
representation. It may be feasible to avoid such a change by stating that a
candidate function that involves such an implicit conversion is only a viable
function if no other viable non-template functions are identified, but the
author has not yet convinced himself of this possibility.</p>
<p>If such implicit conversions are found to be necessary, providing them as
deprecated features would enable a transition period and eventual removal.</p>
<h3 id="design_compat_library">Library backward compatibility features</h3>
<h4 id="design_compat_library_convert_u8string_to_string">
Implicit conversion from std::u8string to std::string</h4>
<p>This proposal includes a new specialization of <tt>std::basic_string</tt>
for the new <tt>char8_t</tt> type, the associated typedef
<tt>std::u8string</tt>, and changes to several functions to now return
<tt>std::u8string</tt> instead of <tt>std::string</tt>. This change renders
ill-formed the following code that is currently well-formed.
<blockquote class="code">
<tt><pre>
void f(std::filesystem::path p) {
std::string s = p.u8string(); // C++14: Ok.
// This proposal: ill-formed unless conversions
// from <tt>std::u8string</tt> to <tt>std::string</tt>
// are provided.
}
</pre></tt>
</blockquote>
</p>
<p>Implicit conversions from <tt>std::u8string</tt> to <tt>std::string</tt>
would be undesirable in general. If they are found to be necessary, providing
them as a deprecated feature seems warranted.</p>
<h2 id="design_type_deduction">Deduced types for UTF-8 literals</h2>
<p>Under this proposal, UTF-8 string and character literals have type
<tt>const char8_t[]</tt> and <tt>char8_t</tt> respectively. This affects the
types deduced for placeholder types and template parameter types.
<blockquote class="code">
<tt><pre>
template<typename T1, typename T2>
void ft(T1, T2);
...
ft(u8"text", u8'c'); // C++14: T1 deduced to const char*, T2 deduced to char.
// This proposal: T1 deduced to const char8_t*, T2 deduced to char8_t.
...
auto u8s = u8"text"; // C++14: Type deduced to const char*.
// This proposal: Type deduced to const char8_t*.
auto u8c = u8'c'; // C++14: Type deduced to char.
// This proposal: Type deduced to char8_t.
</pre></tt>
</blockquote>
</p>
<p>This has the potential to affect backward compatibility in code that depends
on overload resolution selecting the same overload for calls involving both
ordinary and UTF-8 strings. For example:
<blockquote class="code">
<tt><pre>
template<typename T>
void ft(T) {
static int count = 0;
return count++;
}
...
ft("text"); // Returns 0.
ft(u8"text"); // C++14: Returns 1.
// This proposal: Returns 0.
</pre></tt>
</blockquote>
</p>
<h2 id="design_narrow_utf8">
Should UTF-8 literals continue to be referred to as narrow literals?</h2>
<p>UTF-8 literals are maintained as narrow literals in this proposal.</p>
<h2 id="design_char8_t_underlying_type">
What should be the underlying type of char8_t?</h2>
<p>There are several choices for the underlying type of <tt>char8_t</tt>.
Use of <tt>unsigned char</tt> closely aligns with historical use. Use of
<tt>uint_least8_t</tt> would maintain consistency with how the underlying
types of <tt>char16_t</tt> and <tt>char32_t</tt> are specified.</p>
<p>This proposal specifies <tt>unsigned char</tt> as the underlying type as
noted in the changes to § 3.9.1 <tt>[basic.fundamental]</tt> paragraph 5.</p>
<h2 id="design_deprecated">Deprecated features</h2>
<h3 id="design_deprecated_codecvt">
<tt>codecvt</tt> and <tt>codecvt_byname</tt> specializations
</h3>
This proposal introduces new <tt>codecvt</tt> and <tt>codecvt_byname</tt>
specializations that use <tt>char8_t</tt> for conversion to and from UTF-8
and deprecates the existing ones specified in terms of <tt>char</tt>.
The new specializations are functionally identical to the deprecated ones.
<h3 id="design_deprecated_u8path"><tt>u8path</tt> path factory functions</h3>
Filesystem <tt>path</tt> objects may now be constructed with UTF-8 strings using
the existing <tt>path</tt> constructors used for construction with other
encodings as specified in § 27.10.8.2.2 [path.type.cvt] and § 27.10.8.4.1
[path.construct]. This proposal deprecates the existing <tt>u8path</tt> path
factory functions specified in § 27.10.8.6.2 [path.factory].
<h1 id="implementation_exp">Implementation Experience</h1>
<p>None yet, but the author intends to prototype an implementation in
gcc/libstdc++ and/or Clang/libc++.</p>
<h1 id="wording">Formal Wording</h1>
<input type="checkbox" id="hidedel">Hide deleted text</input>
<p>These changes are relative to
<a title="Working Draft, Standard for Programming Language C++"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf">
N4606</a>
<sup><a title="Working Draft, Standard for Programming Language C++"
href="#ref_n4606">
[N4606]</a></sup></p>
<h2 id="core_wording">Core Wording</h2>
<p>Add <tt>char8_t</tt> to the list of keywords in table 3 in 2.11 [lex.key]
paragraph 1. </p>
<p>Change in 2.13.3 [lex.ccon] paragraph 3:
<blockquote>
A character literal that begins with <tt>u8</tt>, such as <tt>u8'w'</tt>, is a
character literal of type <del><tt>char</tt></del><ins><tt>char8_t</tt></ins>,
known as a <em>UTF-8 character literal</em>.[…]
</blockquote>
</p>
<p>Remove 2.13.5 [lex.string] paragraph 7:
<blockquote class=stddel>
A <em>string-literal</em> that begins with <tt>u8</tt>, such as
<tt>u8"asdf"</tt>, is a UTF-8 string literal.
</blockquote>
</p>
<p>Change in 2.13.5 [lex.string] paragraph 8:
<blockquote>
Ordinary string literals and UTF-8 string literals are also referred to as
narrow string literals. <del>A narrow string literal has type “array of n
const char”, where n is the size of the string as defined below, and has
static storage duration (3.7).</del>
</blockquote>
</p>
<p>Add a new paragraph after 2.13.5 [lex.string] paragraph 8:
<blockquote class=stdins>
An ordinary string literal has type "array of n const char", where n is the size
of the string as defined below, and has static storage duration (3.7).
</blockquote>
</p>
<p>Change in 2.13.5 [lex.string] paragraph 9:
<blockquote>
<del>For a UTF-8 string literal, each successive element of the object
representation (3.9) has the value of the corresponding code unit of the UTF-8
encoding of the string.</del>
<ins>A <em>string-literal</em> that begins with <tt>u8</tt>, such as
<tt>u8"asdf"</tt>, is a UTF-8 string literal, also referred to as a
<tt>char8_t</tt> string literal. A <tt>char8_t</tt> string literal has type
"array of n <tt>const char8_t</tt>", where n is the size of the string as
defined below; each successive element of the object representation (3.9) has
the value of the corresponding code unit of the UTF-8 encoding of the
<em>s-char-sequence</em>. A single <em>s-char</em> may produce more than one
<tt>char8_t</tt> code unit.</ins>
</blockquote>
</p>
<p>Change in 2.13.5 [lex.string] paragraph 15:
<blockquote>
[…] In a narrow string literal, a <em>universal-character-name</em>
may map to more than one <tt>char</tt> <ins>or <tt>char8_t</tt></ins> element
due to multibyte encoding. […]
</blockquote>
</p>
<p>Change in 3.9.1 [basic.fundamental] paragraph 1:
<blockquote>
Objects declared <del>as characters</del><ins>with type </ins>
<del>(</del><tt>char</tt><del>)</del> shall be large enough to store any member
of the implementation’s basic character set. If a character from this set is
stored in a character object, the integral value of that character object is
equal to the value of the single character literal form of that character. It is
implementation-defined whether a char object can hold negative values.
Characters <ins>declared with type <tt>char</tt> </ins>can be explicitly
declared <tt>unsigned</tt> or <tt>signed</tt>. Plain <tt>char</tt>,
<tt>signed char</tt>, and <tt>unsigned char</tt> are three distinct types,
collectively called <em><del>narrow</del><ins>ordinary</ins> character
types</em>. <ins>The <em>ordinary character types</em> and <tt>char8_t</tt> are
collectively called <em>narrow character types</em>.</ins> A <tt>char</tt>, a
<tt>signed char</tt>, <del>and </del>an <tt>unsigned char</tt><ins>, and a
<tt>char8_t</tt></ins> occupy the same amount of storage and have the same
alignment requirements (3.11); that is, they have the same object
representation. For narrow character types, all bits of the object
representation participate in the value representation. [ <em>Note</em>: A
bit-field of narrow character type whose length is larger than the number of
bits in the object representation of that type has padding bits; see 9.2.4.
— <em>end note</em> ] For unsigned narrow character types<ins>, including
<tt>char8_t</tt></ins>, each possible bit pattern of the value representation
represents a distinct number. These requirements do not hold for other types.
In any particular implementation, a plain <tt>char</tt> object <del>can</del>
<ins>shall</ins> take on either the same values as a <tt>signed char</tt> or an
<tt>unsigned char</tt>; which one is implementation-defined. For each value
<em>i</em> of type <tt>unsigned char</tt><ins>, or <tt>char8_t</tt></ins> in the
range 0 to 255 inclusive, there exists a value <em>j</em> of type <tt>char</tt>
such that the result of an integral conversion (4.8) from <em>i</em> to
<tt>char</tt> is <em>j</em>, and the result of an integral conversion from
<em>j</em> to <tt>unsigned char</tt><ins> or <tt>char8_t</tt></ins> is
<em>i</em>.
</blockquote>
</p>
<p>Change in 3.9.1 [basic.fundamental] paragraph 5:
<blockquote>
[…] Type <tt>wchar_t</tt> shall have the same size, signedness, and
alignment requirements (3.11) as one of the other integral types, called its
underlying type. <ins>Type <tt>char8_t</tt> denotes a distinct type with the
same size, signedness, and alignment as <tt>unsigned char</tt>, called its
underlying type.</ins> Types <tt>char16_t</tt> and <tt>char32_t</tt> denote
distinct types with the same size, signedness, and alignment as
<tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>, respectively, in
<tt><cstdint></tt>, called the underlying types.
</blockquote>
</p>
<p>Change in 3.9.1 [basic.fundamental] paragraph 7:
<blockquote>
Types <tt>bool</tt>, <tt>char</tt>, <ins><tt>char8_t</tt>, </ins>
<tt>char16_t</tt>, <tt>char32_t</tt>, <tt>wchar_t</tt>, and the signed and
unsigned integer types are collectively called integral types.
</blockquote>
</p>
<p>Change in 4.15 [conv.rank] paragraph 1:
<blockquote>
[…]<br/>
(1.8) — The ranks of <ins><tt>char8_t</tt>, </ins><tt>char16_t</tt>,
<tt>char32_t</tt>, and <tt>wchar_t</tt> shall equal the ranks of their
underlying types (3.9.1).
<br/>[…]
</blockquote>
</p>
<p>Change to footnote 62 associated with 5 [expr] paragraph 11 (11.5):
<blockquote>
As a consequence, operands of type <tt>bool</tt>, <ins><tt>char8_t</tt>, </ins>
<tt>char16_t</tt>, <tt>char32_t</tt>, <tt>wchar_t</tt>, or an enumerated type
are converted to some integral type.
</blockquote>
</p>
<p>Change in 5.3.3 [expr.sizeof] paragraph 1:
<blockquote>
[…] <tt>sizeof(char)</tt>, <tt>sizeof(signed char)</tt><ins>,</ins>
<del>and</del> <tt>sizeof(unsigned char)</tt><ins>, and <tt>sizeof(char8_t)</tt>
are 1. […]
</blockquote>
</p>
<p>Change in 7.1.7.2 [dcl.type.simple] paragraph 1:
<blockquote>
The simple type specifiers are<br/>
<div style="margin-left: 1em;">
<em>simple-type-specifier</em>:<br/>
<div style="margin-left: 1em;">
[…]<br/>
<tt>char</tt><br/>
<ins><tt>char8_t</tt></ins><br/>
<tt>char16_t</tt><br/>
<tt>char32_t</tt><br/>
[…]<br/>
</div>
</div>
</blockquote>
</p>
<p>Change in table 9 of 7.1.7.2 [dcl.type.simple] paragraph 4:
<blockquote>
[…]<br/>
(4.5) — otherwise, <tt>decltype(e)</tt> is the type of e.
<div style="margin-left: 1em;">
<table>
<tr>
<td align="center">
Table 9 — <em>simple-type-specifiers</em> and the types they specify
</td>
</tr>
<tr>
<td align="center">
<table border="1">
<tr>
<th>Specifier(s)</th>
<th>Type</th>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
<tr>
<td><tt>char</tt></td>
<td><tt>“char”</tt></td>
</tr>
<tr>
<td><tt>unsigned char</tt></td>
<td><tt>“unsigned char”</tt></td>
</tr>
<tr>
<td><tt>signed char</tt></td>
<td><tt>“signed char”</tt></td>
</tr>
<tr>
<td><ins><tt>char8_t</tt></ins></td>
<td><ins><tt>“char8_t”</tt></ins></td>
</tr>
<tr>
<td><tt>char16_t</tt></td>
<td><tt>“char16_t”</tt></td>
</tr>
<tr>
<td><tt>char32_t</tt></td>
<td><tt>“char32_t”</tt></td>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
</table>
</td>
</tr>
</table>
</div>
<br/>[…]
</blockquote>
</p>
<p>Change in 8.6 [dcl.init] paragraph 17:
<blockquote>
[…]<br/>
(17.3) — If the destination type is an array of characters, <ins>an
array of <tt>char8_t</tt>, </ins>an array of <tt>char16_t</tt>, an array of
<tt>char32_t</tt>, or an array of <tt>wchar_t</tt>, and the initializer is a
string literal, see 8.6.2.
<br/>[…]
</blockquote>
</p>
<p>Change in 8.6.2 [dcl.init.string] paragraph 1:
<blockquote>
An array of <del>narrow</del><ins>ordinary</ins> character type (3.9.1),
<ins><tt>char8_t</tt> array, </ins><tt>char16_t</tt> array, <tt>char32_t</tt>
array, or <tt>wchar_t</tt> array can be initialized by a narrow string literal,
<ins>char8_t string literal, </ins>char16_t string literal, char32_t string
literal, or wide string literal, respectively, […]
</blockquote>
</p>
<p><em>Drafting note: It is intentional that an array of ordinary character
type can be initialized by a narrow string literal, including UTF-8 string
literals. This is a backward compatibility feature.</em></p>
<p>Change in 13.5.8 [over.literal] paragraph 3:
<blockquote>
The declaration of a literal operator shall have a
<em>parameter-declaration-clause</em> equivalent to one of the following:
<div style="margin-left: 1em;">
[…]<br/>
<tt>char</tt><br/>
<tt>wchar_t</tt><br/>
<ins><tt>char8_t</tt></ins><br/>
<tt>char16_t</tt><br/>
<tt>char32_t</tt><br/>
<tt>const char*</tt>, <tt>std::size_t</tt><br/>
<tt>const wchar_t*</tt>, <tt>std::size_t</tt><br/>
<ins><tt>const char8_t*</tt>, <tt>std::size_t</tt></ins><br/>
<tt>const char16_t*</tt>, <tt>std::size_t</tt><br/>
<tt>const char32_t*</tt>, <tt>std::size_t</tt><br/>
[…]<br/>
</div>
</blockquote>
</p>
<h2 id="library_wording">Library Wording</h2>
<p>Change in 17.1 [library.general] paragraph 7:
<blockquote>
The strings library (Clause 21) provides support for manipulating text
represented as sequences of type <tt>char</tt>,
<ins>sequences of type <tt>char8_t</tt>, </ins>
sequences of type <tt>char16_t</tt>,
sequences of type <tt>char32_t</tt>,
sequences of type <tt>wchar_t</tt>,
and sequences of any other character-like type.
</blockquote>
</p>
<p>Change in 17.3.3 [defns.character] paragraph 3:
<blockquote>
[…]<br/>
[ <em>Note:</em> The term does not mean only <tt>char</tt>,
<ins><tt>char8_t</tt>, </ins><tt>char16_t</tt>, <tt>char32_t</tt>, and
<tt>wchar_t</tt> objects, but any value that can be represented by a type
that provides the definitions specified in these Clauses. —
<em>end note</em> ]
</blockquote>
</p>
<p>Change in 18.3.2.2 [limits.syn]:
<blockquote>
<div style="margin-left: 1em;">
<tt>
[…]<br/>
template<> class numeric_limits<char>;<br/>
template<> class numeric_limits<signed char>;<br/>
template<> class numeric_limits<unsigned char>;<br/>
<ins>template<> class numeric_limits<char8_t>;</ins><br/>
template<> class numeric_limits<char16_t>;<br/>
template<> class numeric_limits<char32_t>;<br/>
template<> class numeric_limits<wchar_t>;<br/>
[…]<br/>
</tt>
</div>
</blockquote>
</p>
<p>Change in 20.14 [function.objects] paragraph 2:
<blockquote>
[…]<br/>
// Hash function specializations<br/>
<tt>template <> struct hash<bool>;</tt><br/>
<tt>template <> struct hash<char>;</tt><br/>
<tt>template <> struct hash<signed char>;</tt><br/>
<tt>template <> struct hash<unsigned char>;</tt><br/>
<ins><tt>template <> struct hash<char8_t>;</tt></ins><br/>
<tt>template <> struct hash<char16_t>;</tt><br/>
<tt>template <> struct hash<char32_t>;</tt><br/>
<tt>template <> struct hash<wchar_t>;</tt><br/>
[…]<br/>
</blockquote>
</p>
<p>Change in 20.14.14 [unord.hash] paragraph 1:
<blockquote>
<tt>
[…]<br/>
template <> struct hash<bool>;<br/>
template <> struct hash<char>;<br/>
template <> struct hash<signed char>;<br/>
template <> struct hash<unsigned char>;<br/>
<ins>template <> struct hash<char8_t>;</ins><br/>
template <> struct hash<char16_t>;<br/>
template <> struct hash<char32_t>;<br/>
template <> struct hash<wchar_t>;<br/>
[…]<br/>
</tt>
</blockquote>
</p>
<p>Change in 21.2 [char.traits] paragraph 1:
<blockquote>
This subclause defines requirements on classes representing <em>character
traits</em>, and defines a class template <tt>char_traits<charT></tt>,
along with <del>four</del><ins>five</ins> specializations,
<tt>char_traits<char></tt>,
<ins><tt>char_traits<char8_t></tt>,</ins>
<tt>char_traits<char16_t></tt>,
<tt>char_traits<char32_t></tt>,
and <tt>char_traits<wchar_t></tt>,
that satisfy those requirements.
</blockquote>
</p>
<p>Change in 21.2 [char.traits] paragraph 4:
<blockquote>
This subclause specifies a class template, <tt>char_traits<charT></tt>,
and <del>four</del><ins>five</ins> explicit specializations of it,
<tt>char_traits<char></tt>,
<ins><tt>char_traits<char8_t></tt>,</ins>
<tt>char_traits<char16_t></tt>,
<tt>char_traits<char32_t></tt>, and
<tt>char_traits<wchar_t></tt>, all of which appear in the header
<tt><string></tt> and satisfy the requirements below.
</blockquote>
</p>
<p><em>Drafting note: 21.2p4 appears to unnecessarily duplicate information
previously presented in 21.2p1.</em></p>
<p>Change in 21.2.3 [char.traits.specializations]:
<blockquote>
<div style="margin-left: 1em;">
<tt>namespace std {</tt><br/>
<tt>template<> struct char_traits<char>;</tt><br/>
<ins><tt>template<> struct char_traits<char8_t>;</tt></ins><br/>
<tt>template<> struct char_traits<char16_t>;</tt><br/>
<tt>template<> struct char_traits<char16_t>;</tt><br/>
<tt>template<> struct char_traits<char32_t>;</tt><br/>
<tt>template<> struct char_traits<wchar_t>;</tt><br/>
<tt>}</tt><br/>