[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode) from Thomas Lord on 2007-03-20 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Tue Mar 20 12:32:05 2007

William D Clinger wrote:
> I am posting this as an individual member of the Scheme
> community. I am not speaking for the R6RS editors, and
> this message should not be confused with the editors'
> eventual formal response.
>
> Thomas Lord wrote:
>
>
>> It would be equally multi-lingual to lock down CHAR as UTF-8
>> code units, UTF-16 code units, scalar values, grapheme clusters.
>>
>
> That sentence appears to conflate three different things.
> I wouldn't bother to point this out, except that ignoring
> those differences creates confusion that runs like an
> eternal golden braid (sorry, Doug!) through these discussions.
>

It doesn't conflate things. If you start with R5, you could "support
Unicode" by identifying CHAR with any kind of code unit, or with
scalar values, or with grapheme clusters. For that matter, you could
also have a Scheme with more than one of those interpretations for
CHAR values, either keeping those sets disjoint or not. I am pointing
out *some* of the design space for Unicode support.

>> It would be equally multi-lingual to permit but not require any of
>> those interpretations and to permit but not require extensions.
>>
>
> To permit any of those three incompatible interpretations
> would be a disaster for portability. We have to pick one
> interpretation, and stick with it.
>
>

I think you are slightly mistaken, there. How would you characterize
the design constraints on CHAR and STRING?

I would think that, as far as textual manipulation goes,
the *least* acceptable thing is that older programs which
manipulate Scheme source should continue to work on
sources which use only the R5 portable character set.
This would be possible even if CHAR were defined to
correspond to UTF-8 code units (not a definition I
suggest for R6 but it satisfies the constraint a priori).

A stronger constraint comes from extending the
lexical syntax of Scheme:

The R5 definition of characters is such that R5 programs
which reflect non-trivially on source texts are unlikely
to work correctly with the extended lexical syntax of 5.92.
For example, the tests for identifier constituent characters
have changed. The atomic unit of source texts in the
lexical syntax is, reasonably, scalar values. So we need
to revise the support for source-text reflection in R6 and
that revision should be scalar-value centric, *arguably*
using the core CHAR/STRING procedures for that purpose.

Some additional, obvious constraints are things like CHAR=?
should be an equality relation and STRING should, indeed, be
simple sequences of CHAR values. (I'm not so sure we
really need to say "finite" sequences.)

The draft goes far beyond satisfying those constraints
and, absent other essential constraints, would seem to
therefore contain the dreaded "arbitrary restrictions".

The last refuge against "arbitrary" is "practical". A design
may be arbitrary but if all alternatives are equally arbitrary
then the arbitrary choice is a practical necessity. Which
brings us to another argument you make:

>> It
>> would be better multi-lingual to permit extensions in areas that Unicode
>> has specifically declined to pursue.
>>
>
> While I am much more sympathetic to extensions than you
> would conclude by reading a draft R6RS, I am not very
> sympathetic to fundamentally incoherent language design,
> which is how I would describe any design that permitted
> implementors to decide for themselves whether Scheme's
> characters should correspond to grapheme clusters, scalar
> values, or code units.
>

There need not be anything incoherent about it: you are overstating.

Let's suppose we agree (we do, I think) that a good design
will allow a portable Scheme source text to be represented
as a STRING of CHAR values which correspond to scalar
values. That's certainly economical and it allows for libraries
that do more sophisticated Unicode textual processing.

It doesn't follow from that that implementations ought not
have, to pick one example, CHAR values that are (at least
sometimes) treated as UTF-8 code units. One way to do it
is to make those additional characters, disjoint from the
scalar values. The code unit externally numbered #x41 would be
distinct from (integer->char #x41). Another way to do it
is to identify similarly number code units and scalar values,
biasing the standard CHAR and STRING procedures to always
prefer the scalar value interpretation, where it makes a
difference. In the former case, a STRING-SET! that places
a UTF-8 code unit into a string would *always* result in
a string that does not correspond to any Unicode text (for
the purposes of, say, DISPLAY). In the latter case,
there would be a need for some other procedure -- say
UTF-8-STRING-SET! -- because STRING-SET! could
never be used to store a UTF-8 code unit in a string.
(The latter case is close to but not quite the same as
something you go on to suggest at the end; see below.)

> With regard to Bucky bits, my conversations with users
> and designers of Common Lisp have given the impression
> that most consider the original inclusion of bucky bits
> in Common Lisp to have been a mistake; X3J13 relegated
> them to implementation-dependent attributes, which are
> more likely to get in the way of writing portable code
> than to serve any portable purpose [2,3].
>

I have not and do not suggest that R6 should standardize
bucky-bits.

> Finally, I'd like to note that the current draft R6RS
> does not actually preclude bucky bits. An implementation
> could add bucky bits to characters while making those
> bits invisible to all of the standard operations on
> characters and strings. An implementation-dependent
> library could make those bits visible. (That would
> mess up some programmers' mental model of eqv?, but
> their model of eqv? is pretty messed up anyway. The
> eq? and eqv? procedures have no special status apart
> from constraints that are laid out by the report(s).)
>
>

I'd given some thought to that. You could describe it by saying
that CHAR->INTEGER *must* be total over CHAR and *may* be
a many-to-one function while INTEGER->CHAR *must* be
non-divergent only for numeric scalar values and, over that
domain, *must* be an injection into CHAR. The language
of 5.92 seems to me to explicitly forbid this interpretation but
you have brought into consideration so let's see where it goes.

It is an interesting idea except that, in general, it needlessly
leaves you with no coherent account of the standard CHAR
ordering predicates and their induced ordering of STRINGS
regarded as simple sequences of CHAR. It would leave
Scheme with no generic sequence-of-char type or any way to
write portable libraries for such a type.

As a matter of opinion, I think that the "maximally extended"
implementation, of which real implementations could be
regarded as various approximations, would have a lattice
of character-like values with CHAR at the top and various kinds
of code unit, scalar values, non-scalar grapheme clusters, and
non-standard character values below that (all disjoint from
one another). In that context, I'd prefer STRING-SET! and
similar to be total over all CHAR values. Naturally, therefore,
CHAR would be only partially ordered. CHAR<->INTEGER
could not reasonably be total over CHAR and would (especially
for a simple approach to reflecting on source texts) have to be
total over scalar values (with their ordinary numbering).

Finally, I think there is a debate implicit in this explicit debate.
It seems to me that you and Cowan would like Scheme programs
to be type-safe in an unprecedented way that goes beyond the
type safety goals of the report: We'd all agree that the report
has the job of defining fundamental type domains over which
non-divergent programs have a well-defined meaning. Where
we split is that I think the notations that comprise a Scheme (or
really, any lisp) program denote a class of meanings which
includes any reasonably constructive extension over the explicitly
defined domains. Therefore, when the report talks about
totality (over domains, of orders, etc.) I think it *should* only
be discussing the reliable portable interpretations of programs,
not imposing a constraint on implementations.

My more radical (or is it more traditional) interpretation,
contrasted with the one I find in the 5.92 language,
makes it impossible to prove that many programs are never
divergent under a conforming implementation (depending
on, for example, their inputs). Yes, well, that's one of the
distinctions between the dynamically, extensibly typed lisp
tradition and languages that are based more closely on
typed lambda calculi.

-t

> Will
>
> [1] http://unicode.org/glossary/
> [2] http://www.supelec.fr/docs/cltl/clm/node25.html#SECTION00624000000000000000
> [3] http://www.lisp.org/HyperSpec/Issues/iss026-writeup.html
>
>
Received on Tue Mar 20 2007 - 12:41:16 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC