[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode) from MichaelL_at_frogware.com on 2007-03-19 (r6rs-discuss.mbox)

From: MichaelL_at_frogware.com <MichaelL>
Date: Mon Mar 19 20:17:49 2007

> >> Code units (whether UTF-8, UTF-16, UTF-32, or whatever) are
> >> bit patterns that are used to encode Unicode scalar values.
> >> As programmers and as language designers, one of our guiding
> >> principles is that bit patterns don't matter except where
> >> they are forced upon us by the external world, typically via
> >> i/o.
> >
> > Or when the abstraction leaks, as string-ref does for UTF-8 and
> > UTF-16. Do
> > you think that being able to write string-find portably &
> > efficiently is
> > important?
>
> I must've missed it somewhere, so let me ask the stupid question.
> What's the problem with the current draft that prohibits implementing
> string-find portably and efficiently? All I can find in the archives
> are the following two statements:

UTF-8 and UTF-16 require one or more code units to represent a given
scalar value. Since the number of code units depends on the scalar value
being encoded there's no algorithm that maps the i'th scalar value to the
j'th code unit. If you want the i'th scalar value in a UTF-8 or UTF-16
string you have to search for it. And that, of course, is what string-ref
is, a request for the i'th scalar value (returned as a character).

A simple string-find would string-ref each character in a string, and
(given only R5.92RS and UTF-8 or UTF-16) each string-ref would start from
scratch.

There are at least four schools of thought on all of this. First, I
believe that some people think a sufficiently smart compiler could hide
some/many/most of these issues by, for example, caching information or
switching to another encoding on the fly. Second, I believe that some
people think the problem can be resolved or reduced by adding new
abstractions--eg, string-for-each. Third, I believe that some people think
there's nothing wrong with a lower-level API--eg, one that exposes code
units--it simply shouldn't get standardized. Fourth, some people think
that Unicode encodings are inherently leaky and that a lower-level API
should be standardized in order to allow for portable and efficient string
algorithms. Of course, these positions aren't all mutually exclusive.
Received on Mon Mar 19 2007 - 20:17:11 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC