[r6rs-discuss] [Formal] Scheme should not be changed to be case sensitive.

From: John Cowan <cowan>
Date: Wed Nov 15 15:23:29 2006

Thomas Lord scripsit:

> R^_RS is looking for a node on your semi-lattice that represents the
> set of codepoints, excluding surrogates:

That domain exists and is defined, and its name is "Unicode scalar values".

> For example, a Unicode technical report gives you a definition of:

> codepoint_casemapping (C) => C

> That's found on the definition semi-lattice but not where you want it --
> it has the wrong domain in the Unicode standards.

Actually it isn't. Case mapping is done on characters, not on code
points: see section 4.2 of TUS 4.0. In effect, then, case mappings are
only available for Unicode scalar values. There are a few properties that
are defined on codepoints, like General Category, precisely so that it
can assign a GC of Cs to surrogate codepoints. But that is exceptional.

> Yes, there is a bug, in Unicode. Sets of coding values (e.g.,
> 16-bit integers for UTF-16) are defined, along with sequences of these.
> The bug is that the semantic mappings of such sequences to abstractions
> like "sequence of codepoints" or "sequence of abstract characters" are
> partial and, indeed, change over time (as new codepoints are assigned).

A sequence of code units (8-bit, 16-bit, or 32-bit integers) is mapped
to a sequence of Unicode scalar values using UTF-8, UTF-16, or UTF-32
respectively. That does not change. Surrogate codepoints are not
representable in any of these.

It's true that new abstract characters are recognized and assigned to
hitherto-unused scalar values over time. This is neither a bug nor a
feature, but a consequence of the Consortium's lack of omniscience and
its inability to find money to pay for everything all at once.

> The bug is reflected when a programmer is writing some function and
> wondering "what should happen if the input data is 'ill-formed'?" and,
> most often, finding no good answer.

Byte-encoded data may be ill-formed. Every sequence of Unicode scalar
values is well-formed. Consequently, every R5.91RS string is well-formed.

> The bug is easy enough to fix but, instead, the R^_RS draft opts for
> a work-around. Of all of the questions about "ill-formed sequences"
> that the Unicode bug gives rise to, the R^_RS draft decides "The most
> important class of ill-formed sequences are those that can not be
> represented in UTF-16." Then, it uses the definition "utf-16-able" to
> (forcibly) prevent programmers from creating any strings that contain
> such sequences.

The sequences in question cannot be represented in *any* well-formed
Unicode encoding. There is a historical connection with UTF-8 here,
but it is clear that none of the seven standard encodings (nor any
other known to me) can represent surrogate code points.

> Let us suppose that we define two new encoding forms: "real-utf-8"
> and "real-utf-16".

This was the meaning of my remark about reinventing Unicode.
The Unicode scalar value space is what it is; the hex numbers
0-D7FF and E000-10FFFF. The three styles of code-unit encoding
and the seven styles of byte encoding are also fixed.
Adding new ones helps nobody.

(Unicode very reluctantly defined an encoding called "CESU-8"
so that there would be a label for buggily encoded UTF-8-but-not-really
as implemented in Oracle. It's also used internally by Java strings
(with an additional deviation added). It's extremely special-purpose.)

> I find that to be a very appealing model: CHAR? is just a 32-bit
> value with a disjoint type tag. STRING? is just a sequence of CHAR?.

Doubtless it is appealing. It's also deeply misleading.

> The main *cost* of that model is that if I try to DISPLAY a string on
> a port that only emits, say, UTF-8, then I have to be prepared for a
> new kind of run-time error: trying to write a string on a port that
> can't handle it. Your situation may be different but, for my needs,
> I think that this is a fine trade-off, especially because it lets me
> write programs that deal with ill-formed Unicode in a rational way.

Experience tends to show that there isn't much to do when an
encoding is ill-formed, other than to reject the input or mutate the
offending octets to something marginally sensible.

> Really? Here is a port which *might* contain some valid XML.
> How will you interpret what you read from it?

As XML. If it's not XML, I won't interpret it: that's part of the
requirements of the XML Rec, that XML processors MUST NOT interpret
input that is not well-formed XML (which entails being well-formed with
respect to the declared or implied encoding.)

-- 
John Cowan      http://www.ccil.org/~cowan      cowan_at_ccil.org
Be yourself.  Especially do not feign a working knowledge of RDF where
no such knowledge exists.  Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass.  --DeXiderata, Sean McGrath
Received on Wed Nov 15 2006 - 15:23:22 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC