[r6rs-discuss] Strings from Thomas Lord on 2007-03-23 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Fri Mar 23 14:02:58 2007

Jason Orendorff wrote:
> If R6RS exposes code units, it should either
> standardize on a representation everyone can live with; or
> set the code unit API aside in a separate library, maybe
> (r6rs string-code-units), so people won't naively trip over it.

Not everything has to go "in R6" -- SRFIs are better suited
for new features that are more speculative.

Exposing coding units (or not) in an API has nothing
to do with how strings are represented internally.

A perfectly reasonable environment would support (roughly
as others have suggested in these threads):

     (define s <some-string-yielding-expression>)

     (string-ref s n) => <a codepoint, normally>
     (string-ref-utf8 s n) => <a utf8 encoding unit>
     (string-ref-utf16 s n) => <a utf16 encoding unit>
     (string-ref-grapheme s n) => <a grapheme cluster>
     etc.

all working on a single string.

The "representational tower" is still there, and is still a valid
and useful programming construct, regardless of how performance
of such procedures vary among implementations. In some
sense, it is an empirical question, yet to be answered, which
internal representations will have the best survival characteristics
in the context of yet-to-be-written portable Scheme code. Meanwhile,
in the many, many cases where string performance is not critical,
portable programs can usefully operate at whatever point on the
representation tower makes the most semantic sense for the
application at hand.

One of the big issues, to my mind, is what constraints R6 will
place on the fundamental types of the domains and and ranges
of procedures such as in that list above. Another is over which
domains that functions should be required to be total.

Some argue that, for example, a (hypothetical procedure) such
as STRING-REF-UTF16 ought to return an integer. I regard
this is a misplacement of some otherwise desirable disjointness
properties and a violation of an abstraction barrier:

The Scheme tradition, distinct from the broader lisp tradition,
finds a reasonable distinction between character-like values
and integers, and string-like sequences of these. It is an
error, in R5, to try to store an integer in a string without
explicitly signaling that a conversion is intended. Integers are
reserved for specifically numeric computation -- character-like
values for communications. Not all lisps are so. In Emacs
lisp, there is no distinct character-like type at run-time (only
in the lexical syntax). In Emacs lisp, "characters" *are*
integers. But Scheme has a commitment to making the distinction
and UTF16 coding values are clearly more character-like
than number-like. So, the treatment of UTF16 values in Scheme
as integers violates an abstraction barrier.

That violation of an abstraction barrier is apparently motivated
(in 5.92) by a misplacement of an otherwise desirable disjointness
property. If we accept that, indeed, the conceptual class of
character-like values (and their string-like forms) includes such
things as UTF16 values, that suggests that they should (at least
be *permitted* to be) CHAR values. On the other hand,
it is clearly a (conceptual) type error to try to use an arbitrary
UTF16 value as a character-like value in, for example, an
algorithm intended to be expressed over Unicode code-points.
One would like to be able to write algorithms at each level
of the Unicode representation tower but, use dynamic type
checking to catch errors where a non-code-point value is used
in a code-point algorithm. If both UTF16 character-like
values and code-point values are present, arguably they should
be of disjoint types.

Cowan's proposal (he was speaking of bucky-bit characters
but the same logic applies) is that "super types" of CHAR
should be permitted. While I hope not to put words in his
mouth, I think it is fair to say that his logic suggests an
implementation with:

        (UTF16? x) implies (CHARLIKE? x)
        (CHAR? x) implies (CHARLIKE? x)
        (GRAPHEME? x) implies (CHARLIKE? x)
        etc.

and one could then write textually insensitive string
processing algorithms over an imagined STRINGLIKE? type
rather than over STRING? per se.

In that sense, we're left arguing mostly over the names of
things and I'm on the side that says the proper name for
the imagined CHARLIKE? type is actually, gosh, CHAR?.

Consider the status of string algorithms coded against R5.
Some are textually sensitive and will need
revision before they can work correctly with Unicode
texts. Surely, for those, it is not an unreasonable burden
to replace CHAR? with CODEPOINT? (and so forth).
Other R5-based string code is textually insensitive and
makes perfect sense whether interpreted as operating on
encoding values, code-points, graphemes, traffic light
control signals, or any other character-like value.

The core definitions of Scheme have always aimed to be
minimal and as general as practical with regard to the
"fundamental [aka basic] types". I can't imagine (and
have yet to hear) any non-handwavy explanation of
CHAR-as-code-points-only. Rather, it seems to me that
the natural extension of R5 has CHAR-as-communications-token.
CHAR is the generic type (and STRING its sequence type)
for all character-like tokens. This isn't just a matter of
"taste": as I've point out in my formal comment (the
proposed section 2.2) this generic conception of CHAR is
nicely grounded in both our most fundamental theories of
discrete communications /and/ the important semantic role
of ports in interpreting the report.

-t
Received on Fri Mar 23 2007 - 14:12:23 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC