> > > And most (but not all) Unicode string implementations use UTF-16.
> > > Among languages and libraries that are very widely used, the
majority
> > > is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
> > > Xerces-C, and on and on.
> >
> > (...and Windows and Mac and IBM's ICU and PHP 6 and...)
> >
> They don't use UTF-16 by choice.
Are you sure about that Alexander? Here's what the Unicode Consortium's
FAQ at
http://unicode.org/faq/utf_bom.html says:
Q: Why are some people opposed to UTF-16?
"Even in East Asian text, the incidence of surrogate pairs should be well
less than 1% of all text storage on average."
Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory?
"These features were enough to swing industry to the side of using Unicode
(UTF-16). While a UTF-32 representation does make the programming model
somewhat simpler, the increased average storage size has real drawbacks,
making a complete transition to UTF-32 less compelling."
Q: How about using UTF-32 interfaces in my APIs?
"Except in some environments that store text as UTF-32 in memory, most
Unicode APIs are using UTF-16."
There are similar recommendations at
http://www.unicode.org/faq/programming.html:
Q: When would using UTF-16 be the right approach?
"If the APIs you are using, or plan to use, are UTF-16 based, which is the
typical case, then working with UTF-16 directly is likely your best bet.
Converting data for each individual call to an API is difficult and
inefficient, while working around the occasional character that takes two
16-bit code units in UTF-16 is not particularly difficult (and does not
have to be expensive)."
Q: How about converting to UTF-32?
"However, UTF-32 is often a poor choice for high performance, since (on
average) only half as many characters will fit your processor cache."
There are plenty of people who documented why they made the decisions they
did. They considered UTF-32 and rejected it. The reasoning is always the
same.
Here is a discussion by the ICU team of why they moved from UCS-2 to
UTF-16 (instead of UTF-32, etc.):
http://icu-project.org/docs/papers/surrogate_support_iuc17.ppt
Here is a discussion by someone who was involved in adding Unicode to
JavaScript:
http://icu-project.org/docs/papers/internationalization_support_for_javascript.html
Those documents are eight years old. We aren't leading this revoultion,
we're following it.
=====
On the separate issue of whether or not string-ref should work with
characters (scalar values) or code units, there's also this at
http://unicode.org/faq/utf_bom.html:
Q: How about using UTF-32 interfaces in my APIs?
"However, while converting from such a UTF-16 code unit index to a
character index or vice versa is fairly straightforward, it does involve a
scan through the 16-bit units up to the index point. In a test run, for
example, accessing UTF-16 storage as characters, instead of code units
resulted in a 10? degradation. While there are some interesting
optimizations that can be performed, it will always be slower on average.
Therefore locating other boundaries, such as grapheme, word, line or
sentence boundaries proceeds directly from the code unit index, not
indirectly via an intermediate character code index."
The JavaScript article above talks about that issue too. The author seems
to think that hiding code units doesn't make sense. He points out that,
for example, removing a character without removing its subsequent
combining characters produces garbage. He also points out that Korean and
Hindi syllables are usually thought of as single characters though they're
represented in Unicode as individual characters; so again, removing one of
the two characters produces garbage. As he puts it: "This already requires
a higher-level facility than the ones normally used to count and index
characters in a string." Scalar values aren't at the top of the
abstraction tree; you can screw things up even at that level, so why hide
code units?
Received on Thu Mar 22 2007 - 18:38:56 UTC