[r6rs-discuss] Unicode issues from William D Clinger on 2007-08-28 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Tue, 28 Aug 2007 12:30:29 -0400

Most of these issues have already been discussed at
length in connection with SRFI 75 or the R6RS, so I'll
limit myself to brief responses to Daniel Ehrenberg's
six numbered issues.

> 1. In order to follow the suggestion in section 11.12 "Implementors
> should make string-ref run in constant time." only fixed-width
> encodings, like UTF-32 can be used.

The word "should" means this is only a recommendation,
so it cannot be as problematic as the many absolute
requirements of the R6RS that should not have been.

Many people have claimed that O(1) time is achievable
only with fixed-width encodings, but those claims are
false. What is true is that the specific variable-width
encodings that are described by the Unicode standard,
primarily for use in files, do not support O(1) access
and are therefore relatively unattractive as in-memory
representations, although issues of backward compatibility
combined with failures of imagination and foresight have
led to premature standardization on such representations
in several programming languages we both could name.

Implementations of the R6RS may use variable-width
representations while achieving O(1) amortized time for
string-length, string-ref, and string-set!.

On your blog, you stated that you couldn't figure out
which encoding Larceny uses [1]. That's because Larceny
is representation-independent: the current version of
Larceny can be built with either of two different
fixed-width encodings, and a future version will offer
at least one variable-width representation that still
provides O(1) access.

> 2. There are no explicit provisions for a limited repertoire of
> characters, where resources are limited. It looks like all characters
> must be supported.

Correct. That is one of the many absolute requirements
of the R6RS that should not have been.

> 3. The ordering of characters and strings is done by the Unicode
> scalar values. In the base library, eight comparison operators are
> included, and the Unicode library adds eight more. However, these are
> only useful in very limited circumstances and create the misleading
> impression that they're suitable for collation for humans. (see UAX 10
> for a more linguistically accurate collation)

I count nine in each, for a total of eighteen. All are
generalizations of the corresponding comparison operators
of the R6RS. This generalization was done for backward
compatibility.

One can argue that backward compatibility should not
have been a goal of the R6RS. In my opinion, however,
the R6RS should have paid even more attention to backward
compatibility.

UTS 10, Unicode Collation Algorithm, is a Unicode Technical
Standard (UTS), not a standard annex (UAX) [2]. As UTS 10
itself states:

    A Unicode Technical Standard (UTS) is an independent
    specification. Conformance to the Unicode Standard
    does not imply conformance to any UTS.

In other words, the R6RS can conform to the Unicode standard
without conforming to UTS 10. The eighteen comparisons that
are part of the R6RS standard libraries do not preclude other
libraries that implement UTS 10, but that was considered to
be beyond the scope of the R6RS standard libraries, and
properly so IMO.

> 4. The eight case folded comparison functions (like string-ci<?)
> appear to be an ad-hoc attempt at something closer to appropriate
> collation, where "abc" comes before "AZZ". However, a better approach
> would be to use case as a tie breaker. A similar approach is needed
> for accent marks, if the output is to be consumed by humans. The
> inclusion of these comparisons is misleading and of very limited
> utility. (again, see UAX 10)

See my response to issue 3.

> 5. Case conversion in the Unicode library does not incorporate locale,
> so a third-party library will have to be used to provide correct
> behavior in the Turkish, Azeri and Lithuanian locales.

Correct. In an episode of sanity, the R6RS editors decided
that standardization of some approach to locales would be
premature. The first order of business should be to reach
consensus on the basic approach to Unicode in Scheme.

> 6. The functions char-upcase, char-downcase, char-titlecase and
> char-casefold are inappropriate and incomplete, since they do not
> incorporate information from SpecialCasing.txt in the Unicode
> Character Database. They are implementation details of the
> corresponding operations on strings, and should not be exposed in the
> standard library.

The first two (char-upcase and char-downcase) provide
backwards compatibility with R5RS. How they were
generalized for Unicode was less important than that
they be generalized. I agree that the inclusion of
char-titlecase and char-casefold was questionable,
and I wish that more of the questionable features of
the R6RS had been omitted.

Will

[1] http://useless-factor.blogspot.com/2007/08/r-597-rs-unicode-library-is-broken.html
[2] http://www.unicode.org/reports/tr10/
Received on Tue Aug 28 2007 - 12:30:29 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC