[r6rs-discuss] Strings from MichaelL_at_frogware.com on 2007-03-26 (r6rs-discuss.mbox)

From: MichaelL_at_frogware.com <MichaelL>
Date: Mon Mar 26 16:58:18 2007

> Jason Orendorff wrote:
>
> > And most (but not all) Unicode string implementations use UTF-16.
> > Among languages and libraries that are very widely used, the majority
> > is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
> > Xerces-C, and on and on. (The few counterexamples use UTF-8: glib,
> > expat. And expat can be compiled to use UTF-16.)
> If this is true, then I would expect to find relatively little mention
> of UTF-8 compared to UTF-16 on the internet. However, the google test
> turns up *1,040,000* for *utf-16* versus *173,000,000* for *utf-8*.
> Now, of course I realize that this is a particularly crude technique for

> determining the relative popularity of UTF-8 and UTF-16, but even a very

> crude technique does not cause this much of a discrepancy. 173 : 1 is
> quite a steep ratio.
>
> I'm sure this all has a simple explanation, but if we're going to use
> popularity as a criterion for choosing a string representation, then we
> ought to be really sure that we've got that popularity lined up the
> right way around.
>
> Incidentally: *497,000* for *utf-32*.
>
> Furthermore, the IETF likes UTF-8 best. From the UTF-8 wikipedia page:
>
> The Internet Engineering Task Force (IETF) requires all Internet
> protocols to identify the encoding used for character data with UTF-8 as

> at least one supported encoding.

An encoding can be used in serialization and in memory. The encoding
itself is the same, but serialization has to worry about endiness and
signatures. (Memory formats are always in the endianess of the processor,
which means that endianess is known and the signature isn't required.)

UTF-8 works very well as a serialization format since it has no endianess
issues and since it's reasonably compact. That's given it quite a boost,
at least for tranmission and storage. Indeed, it's now the assumed default
for an increasing number of applications. But that doesn't necessarily
mean that it's the preferred memory format. That depends on the nature of
the application.

The most typical model I've seen is UTF-8 for serialization, UTF-16 for
processing, and UTF-32 for characters. So even a typical model uses all
three encodings.
Received on Mon Mar 26 2007 - 16:57:12 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC