[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

From: William D Clinger <will>
Date: Tue Mar 20 13:33:44 2007

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.

Per Bothner quoting me:

> > I do think most implementors have enough brains to provide
> > efficient O(1) amortized time string-ref,
>
> Well, there may be constraints that complicate that. For example
> a Java-based implementation may want to use java.lang.String
> for immutable strings. The implementations of java.lang.String
> is fixed and inaccessible. Java provides O(1) access for
> UTF-16 code points, but not Unicode scalar values. One can
> get O(1) by adding extra data - but then one is adding an extra
> object and thus extra space for each String, plus one loses some
> level of compatibility/interoperability with Java. One can restrict
> use of java.lang.String to strings in the Basic plan, and use
> some other representation for strings containing charactesr about
> 2^16. I.e. there are solutions, none of them great.

Agreed.

It doesn't sound so bad to me, though. You're going
to have to use something other than java.lang.Scheme
for mutable strings anyway, and I presume your mutable
representation will handle full Unicode. You can use
java.lang.Scheme only for immutable strings that don't
require surrogates, using your mutable representation
for immutable strings that would involve surrogates if
represented by a java.lang.Scheme. The current draft
of the R6RS doesn't require implementations to enforce
immutability of strings (or maybe it does, but if so
that's an error in the draft), and I wouldn't expect
there to be a lot of immutable strings with really odd
characters in them anyway.

> However, it seems to prohibit an implementation that is simple
> (the way a raw array is), space-efficient, and O(1) for
> string-ref/set! Pick any two.

Similarly, Scheme prohibits an implementation of
procedures that is simple (the way raw function
pointers are), correct, and fast. Pick any two.

As Aziz said, I think you're mostly complaining
about the limitations of other programming languages
and libraries with respect to Unicode, and to some
extent about the design of Unicode itself. We can't
fix those problems. We can only work around them.
That it is painful to work around those problems
should come as no surprise, but the only alternative
is to wire those problems into Scheme.

For example...

> But such indexing retrieves code units, not scalar values.
>
> This works fine, since there is no real application where you need
> to index the N'th scalar value of a string.

Then why are we having this conversation?

We are having this conversation because there are *lots*
of applications that need to index either (1) the Nth
scalar value of a string or (2) the Nth code unit of
some particular representation of the string.

> If we add:
> (string-codepoint-ref str i)
> then we can achieve all three.

Here's an implementation of that in R5.92RS Scheme:

    (define (string-codepoint-ref str i)
      (char->integer (string-ref str i)))

What you were trying to say, I think, is that you want to
add string-codeunit-ref. Since there are three standard
forms of code units, you would need three procedures,
not just one:

    string-codeunit-utf-8-ref
    string-codeunit-utf-16-ref
    string-codeunit-utf-32-ref

Making all three of those run in O(1) time is much harder
than making string-ref run in O(1) time.

> Furthermore, the draft should remove the datatype of fixed-length
> mutable string, since it is
> (a) useless,
> (b) difficult to implement unless you also make it variable-length.

It isn't useless, and making it variable-length makes it
harder to implement, not easier, but I agree that having
variable-length strings would be useful and would add
little additional complexity to an implementation of
strings as specified in the current draft R6RS.

Will
Received on Tue Mar 20 2007 - 13:33:27 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC