[r6rs-discuss] Strings as codepoint-vectors: bad

From: Jason Orendorff <jason.orendorff>
Date: Sun Mar 18 16:25:13 2007

On 3/16/07, Thomas Lord <lord_at_emf.net> wrote:
> The generic error is wishing for some "easy way out" that
> makes Unicode as easy to hack as ASCII. Won't happen.
> Text is just not that simple. Unicode does a fantastic job of
> making it "... but no simpler".

Obviously I've done a very poor job expressing myself.

There is, as you mentioned elsewhere, a tower here:
  - text
  - grapheme clusters
  - Unicode scalar values
  - code units

R6RS presents strings as sequences of Unicode scalar values, as though
(a) nothing much useful can be done with the code units; (b) if the
code units are hidden, implementors can reasonably choose whatever
representation they want, and (c) just hiding code units is very
helpful to programmers. All three statements are false.

(a) UTF-8 and UTF-16 were designed to facilitate writing efficient
algorithms. Hiding them hides this facility. R5.92RS leaves the
programmer with neither (string-find) nor a decent way to implement
it.

(b) Any implementation that chooses to represent strings in UTF-8 or
UTF-16 will have unacceptably bad performance running simple portable
code that uses (string-ref), because (string-ref) will be O(N).

(c) If you know Unicode, it's not hard to work with code units. UTF-8
and UTF-16 were explicitly designed with this in mind. If you don't
know Unicode, you're unlikely to write correct code on top of the
R5.92RS libraries anyway. Hiding code units eliminates exactly one
pitfall--among *many*.

There's no "easy way out" aspect to it. The string abstraction in
R5.92RS simply doesn't make sense to me as an abstraction.

-j
Received on Sun Mar 18 2007 - 16:25:05 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC