[r6rs-discuss] Strings as codepoint-vectors: bad

From: Shiro Kawai <shiro>
Date: Sun Mar 18 17:26:25 2007

From: "Jason Orendorff" <jason.orendorff_at_gmail.com>
Subject: Re: [r6rs-discuss] Strings as codepoint-vectors: bad
Date: Sun, 18 Mar 2007 16:25:05 -0400

> (a) UTF-8 and UTF-16 were designed to facilitate writing efficient
> algorithms. Hiding them hides this facility. R5.92RS leaves the
> programmer with neither (string-find) nor a decent way to implement
> it.
>
> (b) Any implementation that chooses to represent strings in UTF-8 or
> UTF-16 will have unacceptably bad performance running simple portable
> code that uses (string-ref), because (string-ref) will be O(N).
>
> (c) If you know Unicode, it's not hard to work with code units. UTF-8
> and UTF-16 were explicitly designed with this in mind. If you don't
> know Unicode, you're unlikely to write correct code on top of the
> R5.92RS libraries anyway. Hiding code units eliminates exactly one
> pitfall--among *many*.

If these are the issues, can't they be solved if there are
(1) a set of lower-level API to peek into the underlying string
representation, and (2) a set of higher-level API that does all
clever implementation-dependent optimization under the hood?

For example, Gauche provides a way to mirror a string as a
byte vector (srfi-4 vector) to expose its internal multibyte
representation, and also a bunch of string search & extraction
APIs that internally take advantage of underlying representation
(e.g. regexp matcher works on byte sequence rather than character
sequence internally) and avoid indexed string operation
(e.g. directly returning matched substrings).

A difficulty is that, although it provides a way to write
optimized string operations in Scheme, it still doesn't allow
to write a portable, universally efficient code in Scheme.
A code that works very well on utf-8 representation would perform
not so well if the implemenation's native string is utf-16;
even if the portable code switches for utf-8/16/32 specialized code,
it may still work badly if the implementation uses "smarter"
strings like ropes. Higher-level API is a bit better, but
there are still difference---for example, in Gauche returning
matched substring from string matcher makes sense, since
taking substring is pretty fast. (The body of string is immutable
and shared. OTOH, string-set! is disastrously slow). Other
implementation may make different choice, saving string-set!
but incurring copying in substring. Given such variations,
I think it is very hard to write portable, universally efficient
code no matter how you expose the string representation.

In practice, when I want to use portable Scheme libraries
in Gauche I just go through the code and redefines some string
manipulationg procedures and that's enough.

This issue was raised before, and several suggestions---
like using opaque pointer to manipulate strings, and making
strings immutable---are made, but I think the general consensus
was to put off this issue to R7RS or later, for there are
enough other issues in R6RS. So, for R6RS, I'm content
as far as R6RS isn't restrict the choice of string implementation
too much.

--shiro
Received on Sun Mar 18 2007 - 17:26:14 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC