[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)
I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.
MichaelL wrote:
> Or when the abstraction leaks, as string-ref does for UTF-8 and UTF-16.
I don't understand what you mean by saying "the abstraction
leaks" for string-ref and/or UTF-8 and UTF-16, particularly
since the draft R6RS does not tell implementations to use
UTF-8 or UTF-16 or not to use UTF-8 or UTF-16.
> Do
> you think that being able to write string-find portably & efficiently is
> important?
Yes. With the current draft R6RS, that can be done only if
implementors have enough brains to provide O(1) amortized
time for string-ref. Implementors can accomplish that by
any one of dozens of plausible strategies. The simplest
strategy is to use UTF-32, and the more complex strategies
use a mixture of representations, some of which may use
caching.
I don't intend to teach a seminar here on implementation
strategies for O(1) string-ref, but I'll describe just one
simple strategy that achieves O(1) time for both string-ref
and string-set! while using only a little more space than
UTF-8. The basic idea is to represent every string by an
opaque, sealed record whose fields include a vector of
bytevectors. All but the last of those bytevectors is the
UTF-8 encoding of exactly 100 characters; the last one
contains between 0 and 100 characters, inclusive, and
contains 0 characters iff the length of the entire string
is 0.
Implementation of O(1) string-ref and string-set! for that
representation is left as an exercise for readers who
understand big-oh notation.
I don't expect any implementations to use a representation
as bad as the one I described above. That was just to show
that achieving O(1) time for string-ref and string-set! is
child's play compared to some of the other stuff mandated
by the current draft R6RS.
I do think most implementors have enough brains to provide
efficient O(1) amortized time string-ref, but I could be
wrong about that. Programmers who are paranoid about the
performance of string-ref can convert their strings to
bytevectors in whatever byte-level representation they
prefer, and hope that bytevector-ref is O(1).
To make it easier to write representation-specific
algorithms in Scheme, someone could write a SRFI that
provides conversions between R6RS strings and bytevectors
that represent text using UTF-8, UTF-16, or UTF-32,
and provides an appropriate set of operations for each
of those bytevector representations. I don't think this
SRFI needs to be part of the R6RS, since a portable
reference implementation would solve the portability
problem. Folding that SRFI into the R6RS wouldn't
make it run any faster.
> The one other consideration is the use of external libraries. Unicode is a
> very big standard, and parts of it (like collation) are very complicated.
> You really do not want to be writing your own implementation of the
> Unicode Collation Algorithm.
>
> Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's
> ICU--an excellent open source, cross-platform, cross-language [C, C++,
> Java] internationalization library--is UTF-16 (with increasing UTF-8
> support). Linux (and, I believe, Solaris) are UCS-4.
Reading on:
> You left out one popular encoding, UCS-2.
And on:
> On Linux, for example, UTF-8 is increasingly the default system
> encoding--but Linux's wide-chars are UCS-4. Many of libc's string
> operations--eg, strcoll--will work directly on UTF-8 strings; others first
> require conversion to UCS-4.
And on:
> These days UTF-8 is the overwhelming favorite for transmitting and storing
> text, and is the assumed default of almost any new standard.
Summarizing: No single encoding is going to solve the
problem of interfacing with external libraries (which,
by the way, is a problem the draft R6RS does not even
attempt to address).
Conclusion: The R6RS should not mandate any particular
encoding or representation of strings.
The current draft doesn't.
Will
Received on Tue Mar 20 2007 - 00:08:16 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC