[r6rs-discuss] unicode (re comment #134)
R6RS should permit the domain of INTEGER->CHAR to include,
at least, any non-negative integer value.
This is a follow-up to formal comment #134.
The authors advise:
We would be pursuaded by a published recommendation from the
Unicode consortium that seems to us to unambiguously support
your suggestion, e.g., a recommendation of Unicode code
points as a suitable definition for a "character" datatype.
The test of "plausibility" is a low bar and, in this case,
it can be satisfied with a very highly-placed statement
from the Consortium:
section 3.2, conformance requirements C4 and C5:
C4. A process shall not interpret a high-surrogate code
point or a low-surrogate code point as an abstract
character.
* The high-surrogate and low-surrogate code points are
designated for surrogate code units in the UTF-16
character encoding form. They are unassigned to any
abstract character.
C5. A process shall not interpret a noncharacter code
point as an abstract character.
* The noncharacter code points may be used internally,
such as for sentinel values or delimeters, but
should not be exchange publicly.
Noncharacter code points are explicitly described as suitable
for internal use.
I think that that handily satisfies the very reasonable
criterion of "plausible" the authors offered and, hopefully, is
sufficient.
Less formally:
The conformance requirements in the Unicode standard pertain to
exchange and the Technical reports simply elaborate these. The
seriousness of the regard the R6RS authors are showing for the
consortium is appropriate but only when applied to questions of
data exchange.
By charter, intention, process, and outcome you won't find many
official statements from the consortium about programming
language design except as it pertains to the exchange of source
code. They have given careful and weighty consideration to the
question of admitting Klingon to the list of supported languages
but, in contrast, the questions before the R6RS committee have
almost entirely escaped the attention of the formal Unicode
Consortium process.
There *is*, I admit, lots of expertise, paricularly expert
reporting on the accumulated experience of industry, that can be
had from informal communication with those close to the Unicode
consortium. Still, such advise should not be *casually*
transliterated into normative clauses in R6RS, please.
Regards,
-t
p.s.:
* How To Design the Standard for CHAR, STRING, and PORT Types
This is how I think I would do it:
* Claude Shannon (-esque)
A communications channel, loosely speaking, is a
stream of symbols chosen from some alphabet.
* Scheme
Values of the type CHAR represent communication channel
symbols. Values of type PORT represent streams of type
CHAR -- communications channels. Values of type STRING
are the natural algebraic tuples over the CHAR type.
* The Universal CHAR Type
Well beyond R5RS lies the universal CHAR type that achieves
the designation "universal" on the grounds that it is
minimalist and makes a lot of sense. R6RS need not describe
this type completely but it ought not preclude this type.
In the universal CHAR type:
For every natural number (integers greater than or equal to 0)
there exists a distinct CHAR value. The set of all such
values are called "simple characters".
For every finite list of simple characters, there exists a
distinct character which is the "combining sequence" of those.
Conceptually, a non-simple character represents the
parallel transmission of a tuple of simple characters -- the
combination of channels is a channel.
Only simple characters can be reliably converted to and from
integers by portable programs. All characters, however, can
be converted to and from lists of simple characters.
* PORT: Bitstreams to and from CHAR Streams
A "naked" physical communications channel could be
regarded, more or less, as a bitstream. There is more than
one way to divide that bistream up into symbols, in the
Claude Shannon sense. Perhaps it is a stream of octets.
Perhaps it is a stream of UTF-16 encoding values. Perhaps
the bitstream is something else entirely.
It is a matter of higher-level protocols how a port
interprets its bitstream. In Scheme, that means it should
be a matter of how the port is created and configured.
Regardless, though, following our Shannon-esque model, PORTs
always read and write CHAR values, by definition. We treat
CHAR as the "universal alphabet". An N-bit-wide port might
use only the first 2^N CHAR values, for example.
* STRING: The Naive Tuple Type Over CHAR
Physical I/O devices, and their reflections in OS APIs,
tend to read and write symbols over communications channels
in chunks -- sequences of characters. Higher level
protocols, also, often manipulate communication symbols
in sequential chunks.
A natural need arises, therefore, for an optionally mutable
and generally algebraic type which contains all finite tuples
over values of type CHAR -- all possible chunks that might be
useful negotiating with physical I/O or with protocol
requirements.
That, then, is the string type. The problem to be solved is
that, over these ports, I can read and write these CHAR
values, and I need to manipulate chunks of those. The
solution is the STRING type which is nothing more than a
"chunk" (or tuple) of CHAR values.
* Standard Procedures and Unicode
The Scheme standard needs to speak of a standard character set
and encoding forms for conveying portable Scheme programs. It
needs to specify important elements of program and data
exchange such as lexical equivalence among Scheme symbol
representations.
Where the topics of lexical interest to the standard coincide
with topics of interest for standard procedures, the committee
should (obviously) decide otherwise arbitrary questions about
those procedures with reference to the lexical syntax. For
example, Unicode offers an arbitrary choice of algorithms for
canonicalizing identifier names: R6RS should make that choice
for the lexical syntax and then define the corresponding
standard procedures to agree with that specification.
In general, R6RS should specify behavior of programs over
the portable character set, leaving behavior in other
conditions unspecified. That is sufficient to permit
conforming unicode processes to be written in Scheme
and sufficient to preserve "portable metacircularity".
Nothing in such a program suggests any reason to *require*
that implementations may not convert the integer code point
of an unpaired surrogate to a CHAR value. Indeed, no reason
for such a requirement is *anywhere* in evidence beyond an
anecodtal expression of anecodtal opinion by someone close
to the Unicode Consortium.
*Permitting* implementations to forbid conversion of unpaired
surrogate code point integer scalar values to values of type
CHAR would be an entirely reasonable property for R6RS to
have. It would seem to satisfy all substantive Unicode expert
opinions that have been put before the R6RS authors. It would
certainly be the more conservative approach.
-t
Received on Sat Dec 16 2006 - 20:48:25 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:00 UTC