[r6rs-discuss] Strings as codepoint-vectors: bad
Jason Orendorff wrote:
> On 3/16/07, Thomas Lord <lord_at_emf.net> wrote:
>> The generic error is wishing for some "easy way out" that
>> makes Unicode as easy to hack as ASCII. Won't happen.
>> Text is just not that simple. Unicode does a fantastic job of
>> making it "... but no simpler".
>
> Obviously I've done a very poor job expressing myself.
Or me at reading you, or both. But let's dig in a bit....
>
> There is, as you mentioned elsewhere, a tower here:
> - text
> - grapheme clusters
> - Unicode scalar values
> - code units
>
This can't easily be stressed and thought through enough, I think
we agree.
> R6RS presents strings as sequences of Unicode scalar values, as though
> (a) nothing much useful can be done with the code units; (b) if the
> code units are hidden, implementors can reasonably choose whatever
> representation they want, and (c) just hiding code units is very
> helpful to programmers. All three statements are false.
(a) is false, I agree. I think the tentative rationale of r5.92 is that
you can use byte/fixnum manipulation stuff for code units.
(b) and (c) are trickier, though. By hiding code units in the
(historically central, reflectively important) CHAR and STRING
types you do afford a choice of implementation and that can be
helpful even though performance among implementations is likely
to vary a lot more.
There are "holes" in the draft, which as Shiro goes on to suggest,
might be better left for R7 -- reflecting the complete tower of
Unicode representations in a practical way is a good example.
At the same time, for simple uses -- and especially for portable
mechanisms by which scheme code can reflectively manipulate
portable scheme source -- a little agnosticism can go a long way.
I find 5.92 flawed for insufficient agnosticism in the CHAR
domain and, consequently, in constraints on STRING manipulation.
More agnosticism will leave open a door to a better R7, for example. Plus,
that "leaving the door open" or "agnosticism" points to what I regard
as one of the central differentiating characteristics of the report:
that it aspires (never quite perfects, obviously) towards a kind of
essentialism that transcends more conventionally operational approaches
to defining a language.
I'd rather see a period of experimentalism that follows R6 than
some suspect claim that R6 has correctly mandated Unicode support
in a way that locks down the CHAR and STRING types.
>
> (a) UTF-8 and UTF-16 were designed to facilitate writing efficient
> algorithms. Hiding them hides this facility. R5.92RS leaves the
> programmer with neither (string-find) nor a decent way to implement
> it.
Historically, UTF-8 was designed to facilitate rapid development
with efficiency taking a back seat to that. You can google around
and find a funny account of the dinner break at a diner where it
was worked out on the back of a placemat (or something like that).
UTF-16 is fantastic for where space matters more than time and
for culturally biased programming. I'm not sure in which of the two
ambiguous parses I mean "and" there..... :-) I would venture that
adaptive-to-
content representations are probably the long term solution, both
on the wire and in apps, at least for culturally neutral apps.
>
> (b) Any implementation that chooses to represent strings in UTF-8 or
> UTF-16 will have unacceptably bad performance running simple portable
> code that uses (string-ref), because (string-ref) will be O(N).
>
Oh, you mean like the tons and tons of C code that runs all over the
place? Ok, that's ever so slightly different since the C code easily
sees more of the tower but....
> (c) If you know Unicode, it's not hard to work with code units. UTF-8
> and UTF-16 were explicitly designed with this in mind. If you don't
> know Unicode, you're unlikely to write correct code on top of the
> R5.92RS libraries anyway. Hiding code units eliminates exactly one
> pitfall--among *many*.
Just pause there and separate concerns. On the one hand, hiding code units
*might* be a bad idea (though I have no trouble imagining applications,
such as Emacs, where it is a very fine idea). On the other hand, even
though
there are plenty of apps where you want code units exposed, it is one thing
to not standardize how that's done and another thing entirely to make it
utterly impractical even in non-standard-yet-conforming ways. I join
you in you resisting the latter, but not the former.
>
> There's no "easy way out" aspect to it. The string abstraction in
> R5.92RS simply doesn't make sense to me as an abstraction.
>
I agree for overlapping but not identical reasons.
Basically, yes, you're right -- in and of itself the 5.92 spec for
CHAR/STRING
.... and equally my proposed corrections .... -- neither really propels
portable Scheme into a complete programming environment for Unicode
hacking. A sufficiently weak CHAR/STRING would enable a lot of
basic Unicode hacking without precluding expansion, on an uncertain path,
towards a more complete standard.
-t
> -j
>
Received on Sun Mar 18 2007 - 21:44:32 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC