[r6rs-discuss] unicode (re comment #134) from Thomas Lord on 2006-12-17 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Sun Dec 17 11:51:22 2006

  Tom> Noncharacter code points are explicitly described as suitable
  Tom> for internal use.

  John> So they are, and R5.91RS explicitly permits them.
  John> Noncharacter code points are not the same as surrogate code
  John> points, which are *not* explicitly described as suitable (and
  John> are not suitable) for internal use.

You miss the point.

Look back, please, at the language of the same conformance
requirements in version 3.0 of the spec:

   C4. [Don't treat unpaired surrogates as abstract characters]
   C5. [Don't treat U+FFFE or U+FFFF as an abstract character]
   C6. [Don't treat unassigned code values as abstract characters]

    * THESE CLAUSES DO NOT PRECLUDE THE ASSIGNMENT OF
       CERTAIN GENERIC SEMANTICS (FOR EXAMPLE, RENDERING
       WITH A GLYPH TO INDICATE THE CHARACTER BLOCK) THAT
       ALLOW FOR GRACEFUL BEHAVIOR IN THE PRESENCE OF CODE
       VALUES THAT ARE OUTSIDE A SUPPORTED SUBSET OR CODE
       VALUES _THAT_ARE_UNPAIRED_SURROGATES_.

    (emphasis added)

So what is your claim, here, John? That between 3.0 and 4.1
the consortium changed its mind about internal use of
surrogates *but forget to tell anyone*?

Looking at the current language:

Conformance rules C4, C5, and C6 have the same sentence structure and differ
only in a single noun-phrase. They are all bundled together:

     C4. A process shall not interpret a high-surrogate code point or a
           low-surrogate code point as an abstract character.
     C5. A process shall not interpret a noncharacter code point as an
           abstract character.
     C6. A process shall not interpret an unassigned code point as an
           abstract character.

Evidently, these three classes of codepoints are, in the minds of
the Consortium, similar in an important way.

What does the prohibition, to "not interpret a codepoint as an
abstract character", actually mean? Specifically, does it preclude
internal use? The commentaries on C5 and C6 plausibly suggest
otherwise:

     [on C5] The noncharacter code points may be used internally, such
             as for sentinel values or delimiters, but should not be
             exchanged publicly.

     [on C6] This clause does not preclude the assignment of certain
              generic semantics to unassigned code points (for
              example, rendering with a glyph to indicate the position
              within a character block) that allow for graceful
              behavior in the presence of code points that are outside
              a supported subset.

Those expository notes make it very clear that the rule
to "not interpret as an abstract character" does not preclude
internal use. For two out of three classes of the code points
in questions, the Consortium even gave specific examples of
internal use.

It's a *little* odd that the Consortium is mentioning "internal use"
at all since that is outside of their charter. What can one say,
though, other than they mention internal use only to emphasize
that rules like C4..C5 don't apply.

My claim is much simpler. In 3.0, rule C5 mentioned only
two noncharacter codepoints. By 4.1, it was necessary to
change the rule to cover all noncharacter codepoints. As
that change was made, it seems, the commentary about internal
use was word-smithed to make the answers to the most common
questions easier to spot. In the process, it became less
explicit that surrogates, too, can be used internally -- but
that basic fact remains true.

John> Specifically, allowing the representation of surrogate code
John> points means that UTF-16 cannot be used as an internal
John> representation at all (it cannot distinguish between two
John> consecutive surrogate code points and a non-BMP character) and
John> means that UTF-8 and UTF-32 cannot be used directly either, but
John> only in the form of non-standard variants.

I think you are really missing the spirit of writing a
programming language specification.

*Allowing* unpaired surrogates does not *require* that
unpaired surrogates be supported. A portable Scheme
program, under my proposal, can not count on being able to
use unpaired surrogates.

Therefore, under my proposal, if you feel you simply can't
handle unpaired surrogates then fine: don't. Your implementation
is still conforming.

Meanwhile, if I feel I can handle unpaired surrogates,
I will. Under my proposal, my implementation can also be
conforming (under the current language, it can not).

Tom> For every natural number (integers greater than or equal to 0)
Tom> there exists a distinct CHAR value. The set of all such values
Tom> are called "simple characters".

John> Whatever for?

So that the abstract model of character values is mathematically
simple and so that it is a good model for communications generally.

It keeps the model mathematically simple by not introducing
an arbitrary constant (e.g., maximum-char-ordinal). Implementations
can impose a limit of their own, of course.

It keeps the communications model in tact. An N-bit wide port, in
this model, conveys characters 0..2^N-1. With no a priori upper bound
on port width (again, implementations can have limits) there can be no
a priori upper bound on the size of CHAR.

John> There does not exist a countable infinity of
John> simple characters to represent, Galactic Empire or no.

There *do* exist a countable infinity of symbols that can be
passed over a communications channel. Text is an important
use for CHAR values but not the only use.

John> The number is *always* going to be finite, by the nature of
John> graphical representations: if there were a countable infinity of
John> characters, there would be for each character infinitely many
John> that are essentially indistinguishable from it, since each
John> character can be represented as a pixel grid of finite size.

CHAR is not only for naming glyphs.

John> I omit the rest, since it depends on this original and useless
John> notion.

You shorted yourself, then, by not getting to the topic of combining
sequence characters. Remember that, in addition to a simple character
for every integer, I'm also suggesting that all tuples of simple
characters are, themselves characters. That idea isn't entirely
original: Ray Dillenger (Bear) first suggested something similar on a
SRFI list.

Unicode texts can be "framed" in different ways:

        1. divide it up into encoding values
        2. divide it up into codepoints
        3. divide it up into combining character sequences

Low-level Unicode APIs (e.g., libc) tend to give programs
a foundation of (1), leaving (2) and (3) up to library
routines and awkward code.

High level languages, these days, tend to have APIs
at level (2), with level (3) in libraries, and level
(1) buried in the guts of a run-time system.

Bear wondered about the possibility of a high level language
with a basic API at level 3. E.g., STRING-REF would
return the Nth combining character sequence from a
string, not the Nth codepoint (or code unit).

I realized that Ray's idea actually works out very cleanly
and effectively when expressed in terms of my CHAR and STRING
model.

Combining character sequences are, in theory, of potentially
unbounded length. To read and write such sequences atomically
on a port implies that the port is of unbounded width.

At first I thought that that meant a combining character
sequence should be represented as a "simple character" of
very high ordinality. For example, the sequence of codepoints
X Y Z might be represented as:

           (Z << 64) | (Y << 32) | X

but that seemed wrong because, in effect, it places an
a priori upper bound on the number of simple characters.
It has those two arbitrary constants (32 and 64) in it.

Then I thought about it in terms of combining communication channels.
Conveying a combining character sequence is, in effect, the act of
atomically transmitting its constituent codepoints in parallel, on an
ordered list of channels. Transmitting X Y Z, for example, is
logically the same thing as transmitting X on codepoint channel 0, Y
on codepoint channel 1, and Z on codepoint channel 2 all at the same
time.

There is a natural way to model an ordered list of datums for
parallel transmission: as a tuple. Tuples are a natural model for
the aggregate datums conveyed on an ordered list of simple channels.

-t
Received on Sun Dec 17 2006 - 11:54:14 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC