[r6rs-discuss] Strings from Jason Orendorff on 2007-03-21 (r6rs-discuss.mbox)

From: Jason Orendorff <jason.orendorff>
Date: Wed Mar 21 15:30:50 2007

I'm sorry I can't respond to every comment here. A few general things.

Some comments have been dismissive of UTF-8 and UTF-16. Some have
been dismissive of contiguous-buffer strings. This is surprising to
me. As far as I know, *all* widely used, general-purpose string
implementations are contiguous-buffer.

And most (but not all) Unicode string implementations use UTF-16.
Among languages and libraries that are very widely used, the majority
is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
Xerces-C, and on and on. (The few counterexamples use UTF-8: glib,
expat. And expat can be compiled to use UTF-16.)

This is just an argument from popularity, but I think to discourage
this simple, proven internal representation probably wasn't the
editors' intent and would be an unpleasant surprise for implementors
with enough brains to value simplicity and interoperability. ;)

---
Shiro Kawai wrote:
> I think string-find is a bad example, because "simple" and "efficient"
> are opposed.  Efficient means Boyer-Moore or some variant of it, and
> that's not simple.
Well--yes.  Regardless, I think the example is appropriate.  R5.92RS
doesn't support writing *any* such algorithm efficiently and portably
(forget simply).
And:
> I assume the primary benefit of O(1) string-ref is that it is
> probably the simplest and the most portable way to point a
> position in a string.  "Portable" here is that I can safely
> save it to file and read it by other implementation, or
> send it over the network.  But for internal use, like
> implementing search operation, or passing its results to
> substring operation, it is an illusion that O(1) string-ref
> is enough to implement efficient algorithms.  The efficient
> one differs greatly among implementations (e.g. using Boyer-Moore
> directly on utf-8 octet sequence), so it's better to have
> higher-level APIs.
Higher-level APIs are a fine approach.
The other solution is to standardize the implementation, so that the
efficient algorithms don't differ.  I want to push this seriously one
last time:  Unicode strings have been kicked around for a while now,
and despite Will's link, real-world implementations do not vary much.
I don't think it's premature to standardize.
And:
> I agree that r6rs shouldn't be affected just because it can't be
> implemented easily by some specific implementing languages
> (otherwise we wouldn't have call/cc).
First of all, the words "just because" don't belong here.  The Java
thing is an afterthought.
But also-- this is the second time someone has compared strings to
core features of Scheme, like call/cc.  I agree call/cc is too
valuable to give up.  But I don't see what we're talking about here
that's so valuable.  Avoiding a specific bug involving surrogate
pairs?  The freedom for implementors to choose whatever implementation
they want (except, apparently, the one proven model that everyone else
uses)?
There are interesting areas where Scheme *should* be different from
other languages... and there are areas I wish you guys would find
uninteresting :) and just borrow an established design from somewhere.
In this case, there's only one established design to choose from...
-j

Received on Wed Mar 21 2007 - 15:30:34 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC