[r6rs-discuss] perhaps i should be formal, but....
Per Bothner wrote:
> Thomas Lord wrote:
>
>> My question is whether any principled reason for these arbitrary
>> constants is given that might be supported without appeal
>> to analogies to other programming languages.
>
> Consider what happens if Unicode surrogate values are considered
> valid characters. That implies they can stored in a string,
> which is basically a character array.
If strings are simply unrestricted arrays of characters (which
seems like a good idea to me), then it is possible to create
ill-formed Unicode texts whether or not surrogates can be
represented as characters. But to your main point:
>
> Then the question arises is to what it means to index into a
> string: Is it the N'th code point or the N'th scalar value?
> The draft specifies that it's the N'th scalar value - which
> means any use of surrogates must be hidden.
>
There is no question (or such a question is easily resolved).
You are conflating the *internal* use of surrogates by
an implementation with their potential appearance at the
semantic level. In other words, if an implementation
chooses to support surrogates as CHAR values, then that
implementation is obligated to distinguish between any
internal use it makes of surrogates and storing an isolated
surrogate in a string.
For example, an implementation should be *permitted*
to allow:
(define s
(list->string (integer->char #xd800) (integer->char #xdc00)))
but, if it does, then it is mandatory that:
(length s) => 2
> If you allow Unicode surrogate values as actual character
> values that you effectively prohibit an implementation
> for storing characters internally using UTF-16, since you
> can't tell whether a surrogate pair is one Scheme character
> or two. UTF-16 is the natural representation in Java, at least.
> (I think that might be the code in Windows APIs as well.)
On the contrary. When you *allow* surrogate values as
actual character values you don't mandate them. Therefore,
an implementation that uses UTF-16 as you describe is
*also allowed*.
The case of surrogates is handy to illustrate what I think is
a big design mistake (actually several) in the current draft
but allowing surrogates per se is not the main point.
I'm seeing if I can crank out a formal comment that makes
the larger case before tomorrow's deadline (we'll see).
-t
Received on Wed Mar 14 2007 - 16:30:27 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC