[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)
> Is the following a valid summary of the issue?
>
> The existence of string-ref and string-set! operations seems to imply
> that a variable-length internal representation is not an option and
> a fixed-length representation wastes space and is therefore
> inefficient
> (mostly in an ascii-centered world).
Mostly.
The one other consideration is the use of external libraries. Unicode is a
very big standard, and parts of it (like collation) are very complicated.
You really do not want to be writing your own implementation of the
Unicode Collation Algorithm.
Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's
ICU--an excellent open source, cross-platform, cross-language [C, C++,
Java] internationalization library--is UTF-16 (with increasing UTF-8
support). Linux (and, I believe, Solaris) are UCS-4.
If you're serious about supporting Unicode you probably want good UTF-16
support. UCS-4 support out in the wild just isn't very good on most
platforms. (While it's supported on Linux, the implementation is bare
bones and produces some incorrect results.) But if you're serious about
supporting R5.92RS you're faced with a string-ref that makes UCS-4 the
easy path. By "easy" I don't just mean the implementation; I mean meeting
the expectation that string-ref is O(1). If you don't meet that
expectation your performance on a lot of reasonable algorithms will be
very poor.
Furthermore, while it's true that you can convert UCS-4 to UTF-16 without
loss, you probably don't want a system to do that silently each time it
performs a comparison while sorting 100,000 strings. (I'm assuming a
locale-aware comparison.) So in my opinion you want the encoding of
whatever Scheme you use to match the encoding of any library you expect to
use.
> Unicode text encoded in any one of the formats can be converted to
> another without loss of information (right?).
Yes.
You left out one popular encoding, UCS-2. UCS-2 is a 16-bit encoding that
doesn't support surrogate pairs. That limits it to Unicode's Basic
Multilingual Plane. These days UCS-2 would probably be frowned on, but at
least with UCS-2 the code unit size matches the scalar value size for the
scalar values that UCS-2 supports. Gambit and Bigloo are two examples of
Scheme systems that support UCS-2, not UTF-16.
> Moreover, the internal representation of strings does not have to
> match the external representation. For example,
> you can read a UTF-32 encoded file into a variable-length buffer to save
> some space (sometimes); or alternatively, you can read a UTF-8
> encoded file into a fixed-length buffer to save time on
> random-access (sometimes).
Yes.
On Linux, for example, UTF-8 is increasingly the default system
encoding--but Linux's wide-chars are UCS-4. Many of libc's string
operations--eg, strcoll--will work directly on UTF-8 strings; others first
require conversion to UCS-4. (UCS-4 and UTF-32 both encode all Unicode
characters. UTF-32 has additional semantic expectations.)
> From what I understand, UTF-8, UTF-16, and UTF-32 are interchange
> formats.
These days UTF-8 is the overwhelming favorite for transmitting and storing
text, and is the assumed default of almost any new standard. I myself have
never seen anyone transmit or store UTF-32.
Received on Mon Mar 19 2007 - 22:23:16 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC