I think we've gotten way off course. The only reason to standardize
the internal representation of strings would be to expose code units.
Otherwise you wouldn't bother. I can think of two good reasons to
expose code units and one pragmatic reason:
1. Performance. I think R6RS should support a portable regex
library--one that people can actually use. A portable parser
library would also be nice. These things need fast access to
code units.
2. Native call interface. A portable one is beyond the scope of
R6RS, but a standard representation for strings now would
simplify future efforts.
3. (the pragmatic reason) Maybe the editors don't have time to add a
thorough high-level string API to R6RS. I don't know if this is
true or not. If so, a simple, conventional low-level API would
be an improvement over the current draft.
If these reasons are unpersuasive, we need not carry on about UTF-8
vs. UTF-16 etc. etc. If the editors decide that R6RS will not expose
code units, I'll just second Per Bothner's suggestion:
> * More generally, write the specification with the assumption
> that many/most Scheme implementations will use a simple
> UTF-8 array or a UTF-16 array. In the case of mutable
> strings, the array may be grown/relocated, and optionally
> use a buffer-gap scheme. We should not assume or require
> anything more complicated.
On the other hand, if it seems desirable to expose code units, UTF-16
is a good balance of all the factors.
-j
Received on Tue Mar 27 2007 - 13:58:53 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC