[r6rs-discuss] unicode (re comment #134)

From: Thomas Lord <lord>
Date: Sat Dec 16 20:45:35 2006

R6RS should permit the domain of INTEGER->CHAR to include,
at least, any non-negative integer value.

This is a follow-up to formal comment #134.

The authors advise:

    We would be pursuaded by a published recommendation from the
    Unicode consortium that seems to us to unambiguously support
    your suggestion, e.g., a recommendation of Unicode code
    points as a suitable definition for a "character" datatype.


The test of "plausibility" is a low bar and, in this case,
it can be satisfied with a very highly-placed statement
from the Consortium:

     section 3.2, conformance requirements C4 and C5:


        C4. A process shall not interpret a high-surrogate code
            point or a low-surrogate code point as an abstract
            character.

          * The high-surrogate and low-surrogate code points are
            designated for surrogate code units in the UTF-16
            character encoding form. They are unassigned to any
            abstract character.

        C5. A process shall not interpret a noncharacter code
            point as an abstract character.

          * The noncharacter code points may be used internally,
            such as for sentinel values or delimeters, but
            should not be exchange publicly.


Noncharacter code points are explicitly described as suitable
for internal use.

I think that that handily satisfies the very reasonable
criterion of "plausible" the authors offered and, hopefully, is
sufficient.

Less formally:

The conformance requirements in the Unicode standard pertain to
exchange and the Technical reports simply elaborate these. The
seriousness of the regard the R6RS authors are showing for the
consortium is appropriate but only when applied to questions of
data exchange.

By charter, intention, process, and outcome you won't find many
official statements from the consortium about programming
language design except as it pertains to the exchange of source
code. They have given careful and weighty consideration to the
question of admitting Klingon to the list of supported languages
but, in contrast, the questions before the R6RS committee have
almost entirely escaped the attention of the formal Unicode
Consortium process.

There *is*, I admit, lots of expertise, paricularly expert
reporting on the accumulated experience of industry, that can be
had from informal communication with those close to the Unicode
consortium. Still, such advise should not be *casually*
transliterated into normative clauses in R6RS, please.

Regards,
-t

p.s.:

* How To Design the Standard for CHAR, STRING, and PORT Types

  This is how I think I would do it:


* Claude Shannon (-esque)

  A communications channel, loosely speaking, is a
  stream of symbols chosen from some alphabet.


* Scheme

  Values of the type CHAR represent communication channel
  symbols. Values of type PORT represent streams of type
  CHAR -- communications channels. Values of type STRING
  are the natural algebraic tuples over the CHAR type.


* The Universal CHAR Type

  Well beyond R5RS lies the universal CHAR type that achieves
  the designation "universal" on the grounds that it is
  minimalist and makes a lot of sense. R6RS need not describe
  this type completely but it ought not preclude this type.

  In the universal CHAR type:

  For every natural number (integers greater than or equal to 0)
  there exists a distinct CHAR value. The set of all such
  values are called "simple characters".

  For every finite list of simple characters, there exists a
  distinct character which is the "combining sequence" of those.
  Conceptually, a non-simple character represents the
  parallel transmission of a tuple of simple characters -- the
  combination of channels is a channel.

  Only simple characters can be reliably converted to and from
  integers by portable programs. All characters, however, can
  be converted to and from lists of simple characters.
 

* PORT: Bitstreams to and from CHAR Streams

  A "naked" physical communications channel could be
  regarded, more or less, as a bitstream. There is more than
  one way to divide that bistream up into symbols, in the
  Claude Shannon sense. Perhaps it is a stream of octets.
  Perhaps it is a stream of UTF-16 encoding values. Perhaps
  the bitstream is something else entirely.

  It is a matter of higher-level protocols how a port
  interprets its bitstream. In Scheme, that means it should
  be a matter of how the port is created and configured.

  Regardless, though, following our Shannon-esque model, PORTs
  always read and write CHAR values, by definition. We treat
  CHAR as the "universal alphabet". An N-bit-wide port might
  use only the first 2^N CHAR values, for example.


* STRING: The Naive Tuple Type Over CHAR

  Physical I/O devices, and their reflections in OS APIs,
  tend to read and write symbols over communications channels
  in chunks -- sequences of characters. Higher level
  protocols, also, often manipulate communication symbols
  in sequential chunks.

  A natural need arises, therefore, for an optionally mutable
  and generally algebraic type which contains all finite tuples
  over values of type CHAR -- all possible chunks that might be
  useful negotiating with physical I/O or with protocol
  requirements.

  That, then, is the string type. The problem to be solved is
  that, over these ports, I can read and write these CHAR
  values, and I need to manipulate chunks of those. The
  solution is the STRING type which is nothing more than a
  "chunk" (or tuple) of CHAR values.


* Standard Procedures and Unicode

  The Scheme standard needs to speak of a standard character set
  and encoding forms for conveying portable Scheme programs. It
  needs to specify important elements of program and data
  exchange such as lexical equivalence among Scheme symbol
  representations.

  Where the topics of lexical interest to the standard coincide
  with topics of interest for standard procedures, the committee
  should (obviously) decide otherwise arbitrary questions about
  those procedures with reference to the lexical syntax. For
  example, Unicode offers an arbitrary choice of algorithms for
  canonicalizing identifier names: R6RS should make that choice
  for the lexical syntax and then define the corresponding
  standard procedures to agree with that specification.

  In general, R6RS should specify behavior of programs over
  the portable character set, leaving behavior in other
  conditions unspecified. That is sufficient to permit
  conforming unicode processes to be written in Scheme
  and sufficient to preserve "portable metacircularity".

  Nothing in such a program suggests any reason to *require*
  that implementations may not convert the integer code point
  of an unpaired surrogate to a CHAR value. Indeed, no reason
  for such a requirement is *anywhere* in evidence beyond an
  anecodtal expression of anecodtal opinion by someone close
  to the Unicode Consortium.

  *Permitting* implementations to forbid conversion of unpaired
  surrogate code point integer scalar values to values of type
  CHAR would be an entirely reasonable property for R6RS to
  have. It would seem to satisfy all substantive Unicode expert
  opinions that have been put before the R6RS authors. It would
  certainly be the more conservative approach.


-t
Received on Sat Dec 16 2006 - 20:48:25 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC