design/charset_questions.html

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Charset Identity and Registration</title>
</head>

<body bgcolor="#FFFFFF">

<table border="0" cellspacing="0" cellpadding="4">
  <tr>
    <td>To:</td>
    <td><a href="mailto:ietf-charsets@iana.org">ietf-charsets@iana.org</a></td>
  </tr>
  <tr>
    <td>From:</td>
    <td>Mark Davis, IBM</td>
  </tr>
  <tr>
    <td>Re:</td>
    <td>Charset Identity and Registration</td>
  </tr>
  <tr>
    <td>Date:</td>
    <td>2003-05-22</td>
  </tr>
  <tr>
    <td>Latest Version</td>
    <td><a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/charset_questions.html">http://source.icu-project.org/repos/icu/icuhtml/trunk/design/charset_questions.html</a></td>
  </tr>
</table>
<p>We have a number of questions about the application of <a href="http://www.ietf.org/rfc/rfc2978.txt">RFC
2978</a> that are important for resolving which charsets IBM should register. We
will illustrate the issues with a list of named examples. Although these
examples are boiled down to a small set of very few mappings for simplicity,
they do call out the major issues that are important to clarify.</p>
<p><b>Notation</b></p>
<p>Each example consists of a mapping table for a possible charset. The octets
are on the left, with the corresponding Unicode/10646 code points on the right.
The arrows show where there is a mapping, using the following convention:</p>
<table border="1" width="100%" cellspacing="0" cellpadding="4">
  <tr>
    <th align="left">Notation</th>
    <th align="left">Description (where B is an octet sequence and U is a
      Unicode/10646 character sequence)</th>
  </tr>
  <tr>
    <td align="center">B &lt;=&gt; U</td>
    <td>A <i>roundtrip</i> mapping, where B maps to U and U maps back to B</td>
  </tr>
  <tr>
    <td align="center">B &lt;=&nbsp; U</td>
    <td>A <i>fallback</i> mapping, where B does not map to U, but U maps back to
      B</td>
  </tr>
  <tr>
    <td align="center">B&nbsp; =&gt; U</td>
    <td>A <i>reverse-fallback</i> mapping, where B maps to U, but U does not map
      back to B</td>
  </tr>
</table>
<p><b>Examples</b></p>
<ol>
  <li>
    <p>Base<br>
    <code>0x61 &lt;=&gt; U+0061</code></p>
  </li>
  <li>
    <p>Base_Fallback<br>
    <code>0x61 &lt;=&gt; U+0061<br>
    0x61 &lt;=&nbsp; U+FFF1</code></p>
  </li>
  <li>
    <p>Base_Reverse_Fallback<br>
    <code>0x61 &lt;=&gt; U+0061<br>
    0xF1&nbsp; =&gt; U+0061</code></p>
  </li>
  <li>
    <p>Base_Roundtrip1<br>
    <code>0x61 &lt;=&gt; U+0061<br>
    0xF1 &lt;=&gt; U+FFF1</code></p>
  </li>
  <li>
    <p>Base_Roundtrip2<br>
    <code>0x61 &lt;=&gt; U+0061<br>
    0xF1 &lt;=&gt; U+00F1</code></p>
  </li>
</ol>
<p>These examples illustrate a great many instances of character sets. There are
many such examples in the <a href="http://www.icu-project.org/charts/charset/">ICU
Character Set Repository</a>, which contains many cases where two charsets
differ only by additional fallbacks, reverse fallbacks, roundtrips, or a
combination of two or three of these. There are other examples at <a href="http://www.w3c.org/TR/japanese-xml/">XML
Japanese Profile</a> and many others in IBM's database of code page information.</p>
<p>With these examples in mind, here are the questions:</p>
<p><i>Q1. If charset <b>Base</b> is already registered, which of the others is a
possible candidate for registration?</i></p>
<p>&nbsp;In the RFC, the operative language appears to be the following:</p>
<ol type="A">
  <li>&quot;The term &quot;charset&quot; (referred to as a &quot;character
    set&quot; in previous versions of this document) is used here to refer to a
    method of converting a sequence of octets into a sequence of
    characters.&quot;</li>
  <li>&quot;A charset should therefore be registered ONLY if it adds significant
    functionality that is valuable to a large community, OR if it documents
    existing practice in a large community.&quot;</li>
  <li>&quot;Inclusion of a mapping to ISO 10646 is now recommended for all
    registered charsets.&quot;</li>
</ol>
<p>From&nbsp; (A), we have that a charset can be thought of as a set of mappings
from octets to characters. According to that, <b>Base_Reverse_Fallback</b>, <b>Base_Roundtrip1</b>,
and <b>Base_Roundtrip2</b> are definitely not the same charset as <b>Base</b>.
This is because they mapping different sequences of octets. Thus, if they met
criterion B, they would each be candidates for registration. From the text, the
situation is much less clear with <b>Base_Fallback</b>.</p>
<p><i>Q2. If charset <b>Base_Roundtrip1</b> is registered, which of the others
is a candidate for registration?</i></p>
<p>It would appear that <b>Base_Roundtrip2</b> is clearly not the same charset,
and thus if it satisfied B it would be a candidate. But it is unclear whether <b>Base</b>
would be or not, since it would be a complete subset of a registered charset.
And we need confirmation as to the others as well.</p>
<p><i>Q3. How does the RFC handle such additions over time?</i></p>
<p>This was very unclear to us. Assume again that <b>Base</b> is registered.
There are cases where someone <i>adds</i> mappings to a <b>Base</b> (ending up
with something like <b>Base_Roundtrip1</b>), but does not register it as a new
charset. For example, Microsoft added the Euro sign to windows-1252. Does this
require a new registration or not? It also relates to the issue of stability,
below.</p>
<p><i>Q4. Does the RFC require any sort of stability?</i></p>
<p>For example, there is a mapping supplied in <a href="http://www.iana.org/assignments/charset-reg/windows-1252">windows-1252
registration</a>, pointing to <a href="http://www.microsoft.com/globaldev/reference/sbcs/1252.htm">Windows
Code Page 1252</a>, but the contents of the latter page may change over time,
without notice; consulting that link now vs. a year from now might have
different answers, and one doesn't know when and if changes were made in the
past. (This is not a criticism of Microsoft; this happens in many other cases.)
It appears that there is no requirement for stability in the RFC. Is this true?</p>
<blockquote>
  <p>Note that when Microsoft added a mapping for Euro, they actually (e.g. in
  terms of the results of their API) changed the mapping from 0x80 &lt;=&gt;
  U+0080, to 0x80 &lt;=&gt; U+20AC. So on an API level, it was not an addition,
  but a change of mapping, like from <b>Base_Roundtrip1</b> to <b>Base_Roundtrip2</b>.
  However, since the registration didn't actually supply a mapping for 0x80,
  formally speaking it was an addition, like from <b>Base</b> to <b>Base_Roundtrip2</b>.</p>
</blockquote>
<p><i>Q5. Can mapping tables be added to older registrations?</i></p>
<p>Many of the existing registrations do not have mapping tables supplied with
them, and in many of those cases it is difficult to get the original document
that defined the mapping. Should (or could) the original applicants supply
mapping tables for those? For example, should (or could) IBM supply mapping
tables for the &quot;ibm-x&quot; charsets that are already in the table, but do
not have mappings to and from Unicode/10646?</p>
<p><i>Q6. It would appear to us that there are two possible bright-line rules.
Which is the closest to IANA policy?</i></p>
<blockquote>
  <p><i>R1. Two mapping tables get the same name if and only if all of their
  roundtrip mappings are identical.</i></p>
  <p><i>R2. Two mapping tables get the same name if and only if all of their
  mappings (roundtrip, fallback, and reverse fallback) are the same.</i></p>
</blockquote>
<p>R2 is more complete, in the sense that any use of the two mapping tables with
the same IANA name will have identical results. However, it appears that this
policy would not be consistent with what has been practiced in the past. Thus we
are guessing that R1 would be the best bright-line rule. It would at least
guarantee that any mapping tables with the same IANA name <i>if used with
fallbacks and reverse fallbacks turned off</i> would have identical results.</p>
<h3>References:</h3>
<ul>
  <li><a href="http://www.ietf.org/rfc/rfc2978.txt">RFC 2978</a>
    <ul>
      <li><a href="http://www.iana.org/assignments/charset-reg/index.htm">Character
        Set Registrations</a></li>
      <li><a href="http://www.iana.org/assignments/charset-info">Character Sets
        Mailing List Information</a></li>
      <li><a href="http://mail.apps.ietf.org/ietf/charsets/threads.html">Mail
        Archive</a>
        <ul>
          <li>The archive on <a href="http://www.iana.org/assignments/charset-info">Character
            Sets Mailing List Information</a> doesn't appear to work.</li>
          <li>The <a href="http://lists.w3.org/Archives/Public/ietf-charsets/">W3C
            Archive</a> works, but you have to wade through a lot of spam.</li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="http://www.w3c.org/TR/japanese-xml/">XML Japanese Profile</a></li>
  <li><a href="http://www.iana.org/assignments/charset-reg/windows-1252">windows-1252
    registration</a></li>
  <li><a href="http://www.microsoft.com/globaldev/reference/sbcs/1252.htm">Windows
    Code Page 1252</a></li>
  <li><a href="http://www.icu-project.org/charts/charset/">ICU character
    set repository</a></li>
  <li><a href="http://www.unicode.org/reports/tr22/"></a><a href="http://www.unicode.org/reports/tr22/">UTR
    #22: Character Mapping Tables</a></li>
</ul>

</body>

</html>