[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

From: William D Clinger <will>
Date: Mon Mar 19 16:06:31 2007

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.

Thomas Lord wrote:

> It would be equally multi-lingual to lock down CHAR as UTF-8
> code units, UTF-16 code units, scalar values, grapheme clusters.

That sentence appears to conflate three different things.
I wouldn't bother to point this out, except that ignoring
those differences creates confusion that runs like an
eternal golden braid (sorry, Doug!) through these discussions.

Grapheme clusters are particular (not arbitrary) sequences
of characters [1]. (Yes, characters---according to the
Unicode glossary [1].)

Unicode scalar values are code points excluding surrogates
(and correspond pretty closely to the third meaning of
character given by the Unicode glossary [1]).

Code units (whether UTF-8, UTF-16, UTF-32, or whatever) are
bit patterns that are used to encode Unicode scalar values.
As programmers and as language designers, one of our guiding
principles is that bit patterns don't matter except where
they are forced upon us by the external world, typically via
i/o.

> It would be equally multi-lingual to permit but not require any of
> those interpretations and to permit but not require extensions.

To permit any of those three incompatible interpretations
would be a disaster for portability. We have to pick one
interpretation, and stick with it.

> It
> would be better multi-lingual to permit extensions in areas that Unicode
> has specifically declined to pursue.

While I am much more sympathetic to extensions than you
would conclude by reading a draft R6RS, I am not very
sympathetic to fundamentally incoherent language design,
which is how I would describe any design that permitted
implementors to decide for themselves whether Scheme's
characters should correspond to grapheme clusters, scalar
values, or code units.

With regard to Bucky bits, my conversations with users
and designers of Common Lisp have given the impression
that most consider the original inclusion of bucky bits
in Common Lisp to have been a mistake; X3J13 relegated
them to implementation-dependent attributes, which are
more likely to get in the way of writing portable code
than to serve any portable purpose [2,3].

Finally, I'd like to note that the current draft R6RS
does not actually preclude bucky bits. An implementation
could add bucky bits to characters while making those
bits invisible to all of the standard operations on
characters and strings. An implementation-dependent
library could make those bits visible. (That would
mess up some programmers' mental model of eqv?, but
their model of eqv? is pretty messed up anyway. The
eq? and eqv? procedures have no special status apart
from constraints that are laid out by the report(s).)

Will

[1] http://unicode.org/glossary/
[2] http://www.supelec.fr/docs/cltl/clm/node25.html#SECTION00624000000000000000
[3] http://www.lisp.org/HyperSpec/Issues/iss026-writeup.html
Received on Mon Mar 19 2007 - 16:06:09 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC