forked from unicode-org/icu-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcharset_questions.html
198 lines (193 loc) · 9.26 KB
/
charset_questions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Charset Identity and Registration</title>
</head>
<body bgcolor="#FFFFFF">
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td>To:</td>
<td><a href="mailto:[email protected]">[email protected]</a></td>
</tr>
<tr>
<td>From:</td>
<td>Mark Davis, IBM</td>
</tr>
<tr>
<td>Re:</td>
<td>Charset Identity and Registration</td>
</tr>
<tr>
<td>Date:</td>
<td>2003-05-22</td>
</tr>
<tr>
<td>Latest Version</td>
<td><a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/charset_questions.html">http://source.icu-project.org/repos/icu/icuhtml/trunk/design/charset_questions.html</a></td>
</tr>
</table>
<p>We have a number of questions about the application of <a href="http://www.ietf.org/rfc/rfc2978.txt">RFC
2978</a> that are important for resolving which charsets IBM should register. We
will illustrate the issues with a list of named examples. Although these
examples are boiled down to a small set of very few mappings for simplicity,
they do call out the major issues that are important to clarify.</p>
<p><b>Notation</b></p>
<p>Each example consists of a mapping table for a possible charset. The octets
are on the left, with the corresponding Unicode/10646 code points on the right.
The arrows show where there is a mapping, using the following convention:</p>
<table border="1" width="100%" cellspacing="0" cellpadding="4">
<tr>
<th align="left">Notation</th>
<th align="left">Description (where B is an octet sequence and U is a
Unicode/10646 character sequence)</th>
</tr>
<tr>
<td align="center">B <=> U</td>
<td>A <i>roundtrip</i> mapping, where B maps to U and U maps back to B</td>
</tr>
<tr>
<td align="center">B <= U</td>
<td>A <i>fallback</i> mapping, where B does not map to U, but U maps back to
B</td>
</tr>
<tr>
<td align="center">B => U</td>
<td>A <i>reverse-fallback</i> mapping, where B maps to U, but U does not map
back to B</td>
</tr>
</table>
<p><b>Examples</b></p>
<ol>
<li>
<p>Base<br>
<code>0x61 <=> U+0061</code></p>
</li>
<li>
<p>Base_Fallback<br>
<code>0x61 <=> U+0061<br>
0x61 <= U+FFF1</code></p>
</li>
<li>
<p>Base_Reverse_Fallback<br>
<code>0x61 <=> U+0061<br>
0xF1 => U+0061</code></p>
</li>
<li>
<p>Base_Roundtrip1<br>
<code>0x61 <=> U+0061<br>
0xF1 <=> U+FFF1</code></p>
</li>
<li>
<p>Base_Roundtrip2<br>
<code>0x61 <=> U+0061<br>
0xF1 <=> U+00F1</code></p>
</li>
</ol>
<p>These examples illustrate a great many instances of character sets. There are
many such examples in the <a href="http://www.icu-project.org/charts/charset/">ICU
Character Set Repository</a>, which contains many cases where two charsets
differ only by additional fallbacks, reverse fallbacks, roundtrips, or a
combination of two or three of these. There are other examples at <a href="http://www.w3c.org/TR/japanese-xml/">XML
Japanese Profile</a> and many others in IBM's database of code page information.</p>
<p>With these examples in mind, here are the questions:</p>
<p><i>Q1. If charset <b>Base</b> is already registered, which of the others is a
possible candidate for registration?</i></p>
<p> In the RFC, the operative language appears to be the following:</p>
<ol type="A">
<li>"The term "charset" (referred to as a "character
set" in previous versions of this document) is used here to refer to a
method of converting a sequence of octets into a sequence of
characters."</li>
<li>"A charset should therefore be registered ONLY if it adds significant
functionality that is valuable to a large community, OR if it documents
existing practice in a large community."</li>
<li>"Inclusion of a mapping to ISO 10646 is now recommended for all
registered charsets."</li>
</ol>
<p>From (A), we have that a charset can be thought of as a set of mappings
from octets to characters. According to that, <b>Base_Reverse_Fallback</b>, <b>Base_Roundtrip1</b>,
and <b>Base_Roundtrip2</b> are definitely not the same charset as <b>Base</b>.
This is because they mapping different sequences of octets. Thus, if they met
criterion B, they would each be candidates for registration. From the text, the
situation is much less clear with <b>Base_Fallback</b>.</p>
<p><i>Q2. If charset <b>Base_Roundtrip1</b> is registered, which of the others
is a candidate for registration?</i></p>
<p>It would appear that <b>Base_Roundtrip2</b> is clearly not the same charset,
and thus if it satisfied B it would be a candidate. But it is unclear whether <b>Base</b>
would be or not, since it would be a complete subset of a registered charset.
And we need confirmation as to the others as well.</p>
<p><i>Q3. How does the RFC handle such additions over time?</i></p>
<p>This was very unclear to us. Assume again that <b>Base</b> is registered.
There are cases where someone <i>adds</i> mappings to a <b>Base</b> (ending up
with something like <b>Base_Roundtrip1</b>), but does not register it as a new
charset. For example, Microsoft added the Euro sign to windows-1252. Does this
require a new registration or not? It also relates to the issue of stability,
below.</p>
<p><i>Q4. Does the RFC require any sort of stability?</i></p>
<p>For example, there is a mapping supplied in <a href="http://www.iana.org/assignments/charset-reg/windows-1252">windows-1252
registration</a>, pointing to <a href="http://www.microsoft.com/globaldev/reference/sbcs/1252.htm">Windows
Code Page 1252</a>, but the contents of the latter page may change over time,
without notice; consulting that link now vs. a year from now might have
different answers, and one doesn't know when and if changes were made in the
past. (This is not a criticism of Microsoft; this happens in many other cases.)
It appears that there is no requirement for stability in the RFC. Is this true?</p>
<blockquote>
<p>Note that when Microsoft added a mapping for Euro, they actually (e.g. in
terms of the results of their API) changed the mapping from 0x80 <=>
U+0080, to 0x80 <=> U+20AC. So on an API level, it was not an addition,
but a change of mapping, like from <b>Base_Roundtrip1</b> to <b>Base_Roundtrip2</b>.
However, since the registration didn't actually supply a mapping for 0x80,
formally speaking it was an addition, like from <b>Base</b> to <b>Base_Roundtrip2</b>.</p>
</blockquote>
<p><i>Q5. Can mapping tables be added to older registrations?</i></p>
<p>Many of the existing registrations do not have mapping tables supplied with
them, and in many of those cases it is difficult to get the original document
that defined the mapping. Should (or could) the original applicants supply
mapping tables for those? For example, should (or could) IBM supply mapping
tables for the "ibm-x" charsets that are already in the table, but do
not have mappings to and from Unicode/10646?</p>
<p><i>Q6. It would appear to us that there are two possible bright-line rules.
Which is the closest to IANA policy?</i></p>
<blockquote>
<p><i>R1. Two mapping tables get the same name if and only if all of their
roundtrip mappings are identical.</i></p>
<p><i>R2. Two mapping tables get the same name if and only if all of their
mappings (roundtrip, fallback, and reverse fallback) are the same.</i></p>
</blockquote>
<p>R2 is more complete, in the sense that any use of the two mapping tables with
the same IANA name will have identical results. However, it appears that this
policy would not be consistent with what has been practiced in the past. Thus we
are guessing that R1 would be the best bright-line rule. It would at least
guarantee that any mapping tables with the same IANA name <i>if used with
fallbacks and reverse fallbacks turned off</i> would have identical results.</p>
<h3>References:</h3>
<ul>
<li><a href="http://www.ietf.org/rfc/rfc2978.txt">RFC 2978</a>
<ul>
<li><a href="http://www.iana.org/assignments/charset-reg/index.htm">Character
Set Registrations</a></li>
<li><a href="http://www.iana.org/assignments/charset-info">Character Sets
Mailing List Information</a></li>
<li><a href="http://mail.apps.ietf.org/ietf/charsets/threads.html">Mail
Archive</a>
<ul>
<li>The archive on <a href="http://www.iana.org/assignments/charset-info">Character
Sets Mailing List Information</a> doesn't appear to work.</li>
<li>The <a href="http://lists.w3.org/Archives/Public/ietf-charsets/">W3C
Archive</a> works, but you have to wade through a lot of spam.</li>
</ul>
</li>
</ul>
</li>
<li><a href="http://www.w3c.org/TR/japanese-xml/">XML Japanese Profile</a></li>
<li><a href="http://www.iana.org/assignments/charset-reg/windows-1252">windows-1252
registration</a></li>
<li><a href="http://www.microsoft.com/globaldev/reference/sbcs/1252.htm">Windows
Code Page 1252</a></li>
<li><a href="http://www.icu-project.org/charts/charset/">ICU character
set repository</a></li>
<li><a href="http://www.unicode.org/reports/tr22/"></a><a href="http://www.unicode.org/reports/tr22/">UTR
#22: Character Mapping Tables</a></li>
</ul>
</body>
</html>