[r6rs-discuss] Strings

From: MichaelL_at_frogware.com <MichaelL>
Date: Sun Mar 25 23:22:13 2007

> > Summary
> > "This document attempts to make the case that it is advantageous to
use
> > UTF-16 (or 16-bit Unicode strings) for text processing..."
>
> IMHO this is one of the worst mistakes Unicode is trying to make.
> It convinces people that they should not worry about characters above
> U+FFFF just because they are very rare. UTF-16 combines the worst
> aspects of UTF-8 and UTF-32.

No, that's wrong. Here's a direct quote from the document:

"Important: Supplementary code points must be supported for full Unicode
support, regardless of the encoding form. Many characters are assigned
supplementary code points, and even whole scripts are entirely encoded
outside of the BMP. The opportunity for optimization of 16-bit Unicode
string processing is that the most commonly used characters are stored
with single 16-bit code units, so that it is useful to concentrate
performance work on code paths for them, while also maintaining support
and reasonable performance for supplementary code points."

> If size is important and variable width of the representation of a code
> point is acceptable, then UTF-8 is usually a better choice. If O(1)
> indexing by code points is important, then UTF-32 it better. Nobody
> wants to process texts in terms of UTF-16 code units. Nobody wants to
> have surrogate processing sprinkled around the code, and thus if one
> accepts an API which extracts variable width characters, then the API
> could as well deal with UTF-8, which is better for interoperability.
> UTF-16 makes no sense.

No, that's wrong. I've provided links to many documents written by experts
with experience in the field. For example, Dr. Mark Davis is a co-founder
of Unicode, president of the Consortium, original architect of ICU, and
Chief Globalization Architect at IBM. Richard Gillam was a member of IBM's
Unicode Technology Group and an Engineer at the Unicode Technology Center
for Java Technology. He was also part of the team that added Unicode to
JavaScript. People like Markus Scherer have similar backgrounds. Each of
those documents says the same thing: UTF-16 is the best overall trade-off
of space & time & ease-of-use.

But I'll tell you what. Find a document, written by someone with
substantial Unicode experience, that recommends UTF-32 as the best overall
in-memory encoding. I haven't found such a document, not a single one, but
maybe you can. (I mean that; maybe I wasn't searching in the right places
or with the right words.)
Received on Sun Mar 25 2007 - 22:32:50 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC