[r6rs-discuss] Strings from Jason Orendorff on 2007-03-27 (r6rs-discuss.mbox)

From: Jason Orendorff <jason.orendorff>
Date: Tue Mar 27 09:01:53 2007

Jon Wilson wrote:
> Jason Orendorff wrote:
> > And most (but not all) Unicode string implementations use UTF-16.
> > Among languages and libraries that are very widely used, the majority
> > is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
> > Xerces-C, and on and on. (The few counterexamples use UTF-8: glib,
> > expat. And expat can be compiled to use UTF-16.)
> If this is true, then I would expect to find relatively little mention
> of UTF-8 compared to UTF-16 on the internet. However, the google test
> turns up *1,040,000* for *utf-16* versus *173,000,000* for *utf-8*.
> Now, of course I realize that this is a particularly crude technique for
> determining the relative popularity of UTF-8 and UTF-16, but even a very
> crude technique does not cause this much of a discrepancy. 173 : 1 is
> quite a steep ratio.

By this reckoning, UTF-8 is more popular than Unicode, which only
gets 39,000,000 hits. Actually, according to Google, UTF-8 is more
popular than Jesus.

Incidentally, if you don't adjust for cluefulness, UTF-16 is more often
called "Unicode". Dreadful but true, especially in the Windows and
Java worlds. Bottom line: nobody else thinks about this stuff but
language designers and highly clueful library designers.

> The Internet Engineering Task Force (IETF) requires all Internet
> protocols to identify the encoding used for character data with UTF-8 as
> at least one supported encoding.

As a *transmission* format, UTF-8 is much more common than UTF-16,
for good reasons--but nowhere near as common as, say, Latin-1. In
other words, when doing I/O, a transcoding step is usually necessary
anyway.

-j
Received on Tue Mar 27 2007 - 09:01:12 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC