[r6rs-discuss] Strings as codepoint-vectors: bad from Jason Orendorff on 2007-03-18 (r6rs-discuss.mbox)

From: Jason Orendorff <jason.orendorff>
Date: Sun Mar 18 22:56:39 2007

On 3/18/07, Shiro Kawai <shiro_at_lava.net> wrote:
> From: "Jason Orendorff" <jason.orendorff_at_gmail.com>
> Subject: Re: [r6rs-discuss] Strings as codepoint-vectors: bad
> Date: Sun, 18 Mar 2007 16:25:05 -0400
>
> > (a) UTF-8 and UTF-16 were designed to facilitate writing efficient
> > algorithms. Hiding them hides this facility. R5.92RS leaves the
> > programmer with neither (string-find) nor a decent way to implement
> > it.
> >
> > (b) Any implementation that chooses to represent strings in UTF-8 or
> > UTF-16 will have unacceptably bad performance running simple portable
> > code that uses (string-ref), because (string-ref) will be O(N).
> >
> > (c) If you know Unicode, it's not hard to work with code units. UTF-8
> > and UTF-16 were explicitly designed with this in mind. If you don't
> > know Unicode, you're unlikely to write correct code on top of the
> > R5.92RS libraries anyway. Hiding code units eliminates exactly one
> > pitfall--among *many*.
>
> If these are the issues, can't they be solved if there are
> (1) a set of lower-level API to peek into the underlying string
> representation, and (2) a set of higher-level API that does all
> clever implementation-dependent optimization under the hood?

Er, maybe. I'm not sure what you mean by (1). Do you mean something
simple and portable? If so, yes, that's what I'm asking for!

(2) is exactly what's required to meet the goal of hiding the
in-memory representation from users. But defining such an API is a
huge task (arguably no one has yet achieved this for any language),
clearly out of scope for R6RS. Therefore I think R6RS should focus on
providing a good low-level API.

I won't continue to harp on this, but it bears repeating one more
time: The current draft API doesn't support a simple, portable,
efficient (string-find). I find this pretty dismal.

Your comments and others here suggest that implementors and users want
their implementations to do wildly varying things underneath the API
provided by R5.92RS. I would be surprised if this were really true.
One does not write portable code that manipulates text and then run it
on a different implementation when one wants rope-like performance
instead of conventional string-like performance. Really I think each
implementer* wants conventional strings in either UTF-8 or UTF-16.**
I'd be happy to be corrected on this. Unfortunately these are exactly
the representations that are most severely punished by an API that
focuses on Unicode scalar values.

-j

[*] Assuming, arguendo, that the goal is representing strings of
Unicode text.

[**] With implementation-specific speed and memory management hacks,
always. But basically, UTF-8 or UTF-16. Some might initially like
the proposal where there are several representations with varying
character widths, but I suspect that road leads to a lot of suffering.
Educated guess.
Received on Sun Mar 18 2007 - 22:56:34 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC