[r6rs-discuss] [Formal] Scheme should not be changed to be case sensitive. from Thomas Lord on 2006-11-15 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Wed Nov 15 13:55:22 2006

Matthew Flatt wrote:

Tom> R^_RS is getting this all wrong by trying to, arrogantly,
Tom> carve out some entirely *novel* definition of CHAR?.
Tom> [....]

Matthew> I think you're mistaken about the consortium's
Matthew> recommendation. [....]

Unicode is a finite construction out of a small number of finite
sets (e.g., codepoints, character classes) and relations (e.g.,
the character class of codepoint 97 is "L"). Properties
of sequences of some of those values are defined. The technical
reports define sequence->sequence mappings, comprising a
basic text-string algebra, that preserve various invariants
(e.g,. "a downcased string contains no upper-case letters")

There's a kind of semi-lattice formed by Unicode definitions
by treating each definition as a point in the semi-lattice and
direct dependencies between definitions as the definition
of the partial order. For example, the definitions of
character class, codepoint, and the relationship "character class
of" form:

                           char_class_of
                            / \
                       codepoints character classes
                             \ /
                                bottom

That semi-lattice displays the partial order among the logical
definitions of various Unicode terms. Extend that semi-lattice far
enough and you'll have nodes for things like the definition of
downcasing a string.

Well, imagine sitting down with the Unicode standard and writing
out a definition semi-lattice that describes it -- your goal is to
make a fairly minimal semi-lattice whose structure reflects the
structure of the standard in a simple-as-possible way. This
is a trivial exercise (e.g., the definition of the character
class mapping depends (for its domain) on the definition of
a codepoint).

R^_RS is looking for a node on your semi-lattice that represents the
set of codepoints, excluding surrogates:

                         utf-16-able
                         (codepoints - surrogates)
                               / \
                           codepoints surrogates

The set "utf-16-able" is the domain of INTEGER->CHAR, for
example. It is also part of the domain of STRING-SET! and
part of the range of STRING-REF.

The programming language design goal can be understood
as locating language primitives on that semi-lattice, possibly
extending the lattrice to include the language primitives.
For example, the current draft locates CHAR?:

                         CHAR? == utf-16-able
                         (codepoints - surrogates)
                               / \
                           codepoints surrogates

You won't find many unicode definitions that sit above that
on your semi-lattice:


                      utf16 (not much else)
                        \ /
                         CHAR? == utf-16-able
                         (codepoints - surrogates)
                               / \
                           codepoints surrogates

So, if CHAR? is "utf-16-able", you don't get much of Unicode
that follows directly from that. Instead, you're going to have
to make up new definitions.

For example, a Unicode technical report gives you a definition of:

    codepoint_casemapping (C) => C

That's found on the definition semi-lattice but not where you
want it -- it has the wrong domain in the Unicode standards.
You have to define a new thing:

                      utf16 utf_16able_codepoint_casemapping
                        \ /
                         CHAR? == utf-16-able
                         (codepoints - surrogates)
                               / \
                           codepoints surrogates

Your new (restricted) codepoint mapping contains less
information than the real deal. More to the point, the CHAR?
and STRING? primitives in your language are now off in some
newly constructed definition semi-lattice that *resembles* but
is not equal to the Unicode definition semi-lattice. This
should at least begin to freak you out that, perhaps, your
design has gone off track. It isn't a proof that anything is
wrong but it is a good hint. Perhaps there should be some
formal proofs about how the space of strings, so defined,
compares to the space of strings of abstract characters as
recognized by the Unicode consortium.

Towards the top of the Unicode definition lattice are
definitions that ascribe semantics to Unicode strings:
mappings from encoded Unicode to abstract sequences of
abstract characters. These definitions point out a bug
in Unicode.

Yes, there is a bug, in Unicode. Sets of coding values (e.g.,
16-bit integers for UTF-16) are defined, along with sequences
of these. The bug is that the semantic mappings of such
sequences to abstractions like "sequence of codepoints" or
"sequence of abstract characters" are partial and, indeed,
change over time (as new codepoints are assigned). This
is a very serious bug and is the root of most of the
confusion and mess which is Unicode-in-the-real-world.
The bug is reflected when a programmer is writing some function
and wondering "what should happen if the input data is
'ill-formed'?" and, most often, finding no good answer.

The bug is easy enough to fix but, instead, the R^_RS draft opts
for a work-around. Of all of the questions about "ill-formed
sequences" that the Unicode bug gives rise to, the R^_RS draft
decides "The most important class of ill-formed sequences are
those that can not be represented in UTF-16." Then, it uses the
definition "utf-16-able" to (forcibly) prevent programmers from
creating any strings that contain such sequences.

That work-around is not an insane choice for an implementor.
It's perfectly well defined and an implementation that uses that
work-around will be quite useful.

Nevertheless, that workaround is not the *only* choice and it is
not even a *necessary* choice. It is helpful to compare that
work-around to another choice: to fix the bug.

Let us suppose that we define two new encoding forms:
"real-utf-8" and "real-utf-16". These new encoding forms
have the property that they can encode *any* sequence of 32-bit
values, including sequences that contain unpaired surrogates
and isolated UTF-8 trailing code values (e.g., 0x81).

I find that to be a very appealing model: CHAR? is just a
32-bit value with a disjoint type tag. STRING? is just
a sequence of CHAR?. This reduces the number of types
needed to define Unicode and, most importantly, makes it
easier to make all character and string mappings total
functions rather than partial functions.

The main *cost* of that model is that if I try to DISPLAY a
string on a port that only emits, say, UTF-8, then I have to be
prepared for a new kind of run-time error: trying to write a
string on a port that can't handle it. Your situation may be
different but, for my needs, I think that this is a fine
trade-off, especially because it lets me write programs that
deal with ill-formed Unicode in a rational way.

Well, if I define CHAR? and STRING? that way then the relation
between my language definition and the definition semi-lattice
of Unicode is greatly simplified -- I'm no longer rooting all
of my language definitions at the domain "utf-16-able"; I'm
back on the original lattice saying "CHAR? == codepoint", etc.

The resulting programming language has sharp edges, to be sure.
Programs will have no difficulty at all creating ill-formed
strings if that is what they want to do, therefore there will be
a new class of bugs related to accidentally ill-formed strings.
I don't think I would want Javascript to have these semantics,
for example. I'm pretty sure (not certain) I would not want
Emacs Lisp to have these semantics. I see both sides -- but I
would just like my sharp-edged language to be *permitted* by
R^_RS.

I would be happier with an R^_RS that simply requires CHAR? to
be isomorphic to an implementation-defined subset of codepoints
which includes some mandatory members. In other words,
if the domain of INTEGER->CHAR included surrogate codepoints
but, for those codepoints, a conforming Scheme could either
return a CHAR? or signal an error.

-t
Received on Wed Nov 15 2006 - 13:55:16 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC