[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode) from William D Clinger on 2007-03-20 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Tue Mar 20 00:08:21 2007

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.

MichaelL wrote:

> Or when the abstraction leaks, as string-ref does for UTF-8 and UTF-16.

I don't understand what you mean by saying "the abstraction
leaks" for string-ref and/or UTF-8 and UTF-16, particularly
since the draft R6RS does not tell implementations to use
UTF-8 or UTF-16 or not to use UTF-8 or UTF-16.

> Do
> you think that being able to write string-find portably & efficiently is
> important?

Yes. With the current draft R6RS, that can be done only if
implementors have enough brains to provide O(1) amortized
time for string-ref. Implementors can accomplish that by
any one of dozens of plausible strategies. The simplest
strategy is to use UTF-32, and the more complex strategies
use a mixture of representations, some of which may use
caching.

I don't intend to teach a seminar here on implementation
strategies for O(1) string-ref, but I'll describe just one
simple strategy that achieves O(1) time for both string-ref
and string-set! while using only a little more space than
UTF-8. The basic idea is to represent every string by an
opaque, sealed record whose fields include a vector of
bytevectors. All but the last of those bytevectors is the
UTF-8 encoding of exactly 100 characters; the last one
contains between 0 and 100 characters, inclusive, and
contains 0 characters iff the length of the entire string
is 0.

Implementation of O(1) string-ref and string-set! for that
representation is left as an exercise for readers who
understand big-oh notation.

I don't expect any implementations to use a representation
as bad as the one I described above. That was just to show
that achieving O(1) time for string-ref and string-set! is
child's play compared to some of the other stuff mandated
by the current draft R6RS.

I do think most implementors have enough brains to provide
efficient O(1) amortized time string-ref, but I could be
wrong about that. Programmers who are paranoid about the
performance of string-ref can convert their strings to
bytevectors in whatever byte-level representation they
prefer, and hope that bytevector-ref is O(1).

To make it easier to write representation-specific
algorithms in Scheme, someone could write a SRFI that
provides conversions between R6RS strings and bytevectors
that represent text using UTF-8, UTF-16, or UTF-32,
and provides an appropriate set of operations for each
of those bytevector representations. I don't think this
SRFI needs to be part of the R6RS, since a portable
reference implementation would solve the portability
problem. Folding that SRFI into the R6RS wouldn't
make it run any faster.

> The one other consideration is the use of external libraries. Unicode is a
> very big standard, and parts of it (like collation) are very complicated.
> You really do not want to be writing your own implementation of the
> Unicode Collation Algorithm.
>
> Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's
> ICU--an excellent open source, cross-platform, cross-language [C, C++,
> Java] internationalization library--is UTF-16 (with increasing UTF-8
> support). Linux (and, I believe, Solaris) are UCS-4.

Reading on:

> You left out one popular encoding, UCS-2.

And on:

> On Linux, for example, UTF-8 is increasingly the default system
> encoding--but Linux's wide-chars are UCS-4. Many of libc's string
> operations--eg, strcoll--will work directly on UTF-8 strings; others first
> require conversion to UCS-4.

And on:

> These days UTF-8 is the overwhelming favorite for transmitting and storing
> text, and is the assumed default of almost any new standard.

Summarizing: No single encoding is going to solve the
problem of interfacing with external libraries (which,
by the way, is a problem the draft R6RS does not even
attempt to address).

Conclusion: The R6RS should not mandate any particular
encoding or representation of strings.

The current draft doesn't.

Will
Received on Tue Mar 20 2007 - 00:08:16 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC