[r6rs-discuss] Strings from Marcin 'Qrczak' Kowalczyk on 2007-03-26 (r6rs-discuss.mbox)

From: Marcin 'Qrczak' Kowalczyk <qrczak>
Date: Mon Mar 26 07:08:03 2007

Dnia 25-03-2007, nie o godzinie 22:32 -0400, MichaelL_at_frogware.com
napisa?(a):

> "Important: Supplementary code points must be supported for full Unicode
> support, regardless of the encoding form.

That's the theory. But UTF-16 is strictly less convenient than UTF-32,
which means that a lot of code working in terms of UTF-16 doesn't bother
to support supplementary code points.

The C API for character predicates (iswalpha etc.) makes sense when
wchar_t is UTF-32; it can't support supplementary code points when
wchar_t is UTF-16.

The only advantage of UTF-16 over UTF-32 is memory usage, and data
exchange with those who already use UTF-16. *Nothing* in UTF-16 is more
convenient or simpler than UTF-32, it's an additional complexity layer.

> But I'll tell you what. Find a document, written by someone with
> substantial Unicode experience, that recommends UTF-32 as the best overall
> in-memory encoding.

C/C++ on Linux uses UTF-32 for wchar_t. Gtk+ uses UTF-8 internally.
Python can be compiled to use UTF-16 or UTF-32. Perl uses UTF-8. CLISP
uses code points in the API (the internal representation is a mixture
of UTF-32, UCS-2 and ISO-8859-1). iconv uses UTF-32 as its internal
encoding, which means that recoding to/from UTF-32 is faster than UTF-16
or UTF-8.

I've never seen an external file encoded in UTF-16 or UTF-32 on Linux.
The only (almost?) Unicode encoding used for data exchange is UTF-8,
and UTF-32 is the primary temporary representation in memory.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak_at_knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Received on Mon Mar 26 2007 - 07:07:01 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC