[r6rs-discuss] on rationale 9.6 Characters and Strings from Thomas Lord on 2007-06-27 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Tue, 26 Jun 2007 18:13:32 -0700

    */9.6 Characters and Strings/*

    /
    Where R^5 RS specified characters and strings in terms of its own,
    limited character set, R^6 RS specifies characters and strings in
    terms of Unicode. The primary goal of the design change were to
    improve the portability of Scheme programs that manipulate text,
    while preserving a maximum of backward compatibility with R^5 RS.
    /

That part of the rationale displays a significant misunderstand of R^5
RS and, in so doing, sets the stage for the errors of the design for
which this rationale is offered.

R^5 RS does not specify "its own, limited character set". Rather, R^5
RS specifies a /minimal structure/ which all implementations of the
char? type must have (and, over which certain properties of strings are
inductively defined). In contrast, the new draft specifies an /exact
structure/ for the char? type -- it describes a particular Finite Set.
For example, R^5 RS enumerates a list of abstract characters which must
be present in all implementations. The draft, on the other hand,
enumerates a list of abstract characters which must be present and which
are the only characters that /may/ be present in an implementation.
This difference is a large change between R^5 RS and the draft, and so
deserves its own rationale (which, I argue, is not forthcoming).

We should entertain as an a priori hypothesis that R^5 RS fails to
achieve a desired level of portability for Scheme programs that
manipulate text mostly because it specifies too little mandatory
structure for the char? type. The draft overreaches in not only
filling in that additional structure, but then going on to mandate that
no additional structure is permitted. (R^5 RS also gets wrong some
minor technical details, such as the invariants associated with case
conversions. Those details are easy to fix and are not important for
this response to the rationale document). The rationale document, and
the design in the draft, fail to entertain this hypothesis:; that the
char? type should remain "open ended".

    /R^6 RS defines characters to be representations of Unicode scalar
    values, and strings to be indexed sequences of characters. This is a
    different representation for Unicode text than the representations
    chosen by some other programming languages such as Java or C#, which
    use UTF-16 code units as the basis for the type of characters./

    /The representation of Unicode text corresponds to the lowest
    semantic level of the Unicode standard: The Unicode standard
    specifies most semantic properties in terms of Unicode scalar
    values. Thus, Unicode strings in Scheme allow the straightforward
    implementation of semantically sensitive algorithms on strings in
    terms of these scalar values.
    /

While it is certainly desirable to specify a Scheme API which models
Unicode Scalar values, no rationale is offered here to make this the
definition of the char? type. Certainly a new scalar-value? type, a
sub-type of char?, could achieve the same pragmatic aims without doing
such violence to the historic concept of char? as an open-ended class of
possible types.

    /In contrast, UTF-16 is a specific encoding for Unicode text, and
    performing semantic manipulation on UTF-16 representations of text
    is awkward. Choosing UTF-16 as the basis for the string
    representation would have meant that a character object potentially
    carries no semantic information at all, as surrogates have to be
    combined pairwise to yield the corresponding Unicode scalar value.
    (As a result, the APIs of Java for semantic operations on Unicode
    text often come in two overloadings, one for character objects and
    one for integers that are Unicode scalar values.)
    /

This portion of the rationale is simply confused. The phrase "carries
no semantic information at all" is particularly inexplicable (because,
of course: sequences of UTF-16 code values have perfectly well-defined
semantics!).

We suspect that what the author of the rationale intended here was to
assert that sequences of Unicode scalar values are the natural
representation of all texts. That implicit assertion is false:

Not all texts are expressed over the writing systems of human alphabets,
and so, not all texts are the subject of Unicode standardization efforts.

Not all texts are expressed over a taxonomy of writing systems which has
been recognized by the Unicode consortium and, indeed, some texts are
understood to be in writing systems that the Unicode consortium has
explicitly declined to encode.

Many texts of interest in computing can not reasonably be said to be in
any single writing system. Rather, what makes these texts interesting
is precisely that they are simultaneously in several different writing
systems. Thus, for example, a Unicode text may simultaneously be
written as a sequence of grapheme clusters, a sequence of scalar values,
a sequence of well-formed UTF-8 encoding forms, and a sequence of UTF-8
encoding values.

In spite of all the "hair" in real-world texts, nevertheless, a wide
variety of useful algorithms can be pragmatically expressed in terms of
generic operations on an underspecified char? type and the type of
finite sequences of char?. Such would seem to be the traditional
purpose of the character and string types in Scheme.

/*
*/

    /The surrogates cover a numerical range deliberately omitted from
    the set of Unicode scalar values. Hence, surrogates have no
    representation as characters---they are merely an artefact of the
    design of UTF-16. Including surrogates in the set of characters
    introduces complications similar to the complications of using
    UTF-16 directly. In particular, most Unicode consortium standards
    and recommendations explicitly prohibit unpaired surrogates,
    including the UTF-8 encoding, the UTF-16 encoding, the UTF-32
    encoding, and recommendations for implementing the ANSI C wchar_t
    type. Even UCS-4, which originally permitted a larger range of
    values that includes the surrogate range, has been redefined to
    match UTF-32 exactly. That is, the original UCS-4 range was shrunk
    and surrogates were excluded.
    /

All that is explained by that rationale is that, in order to define
certain normative procedures, the Unicode consortium found it handy to
define the set of scalar values (which includes no surrogates).

Many conceivable textual algorithms operate indifferently to whether
characters are understood to be encoding values, scalar values, or
something else entirely -- the algorithms rely only on basic equivalence
and ordering relations. Traditionally, Scheme's character and string
types are generic enough to express such algorithms. This property
will be lost if Scheme's characters are identified with Unicode scalar
values.

    /
    /

    /Arguably, a higher-level model for text could be used as the basis
    for Scheme's character and string types, such as grapheme clusters.
    However, no design satisfying the goals stated above was available
    when the report was written.
    /

That, in and of itself, suggests that the character and string types
should not be limited as they are in the draft. It is arguable, for
example, that Scheme would benefit from a mixed-level model for text,
which included encoding values, surrogates, scalar values, and grapheme
clusters, and perhaps more. That all of these possibilities which
contradict the draft are arguable is further confirmation that draft has
overreached.

-t

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r6rs.org/pipermail/r6rs-discuss/attachments/20070626/ec972ebd/attachment-0001.htm
Received on Tue Jun 26 2007 - 21:13:32 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC