[r6rs-discuss] unicode (re comment #134) from Thomas Lord on 2006-12-17 (r6rs-discuss.mbox)

From: Thomas Lord <lord>
Date: Sun Dec 17 18:38:32 2006

Marcin 'Qrczak' Kowalczyk wrote:
> IMHO the two best choices for the Unicode-based notion of a character
> in a programming language are:
>
> * code points: 0..#x10FFFF
> * Unicode scalar values: 0..#x10FFFF excluding #xD800..#xD7FF
>
> They are simple to understand in the context of Unicode, atomic,
> easy to store in strings, and easy to exchange with other languages
> with Unicode-based strings.
>

I agree, more or less. I would add "a superset of code points"
to the mix.

The issue raised in comment #134 is that, as written, the draft
*requires* that char be "unicode scalar value". An implementation
may not make char the same as code points, or a superset of
code points (at least if integer->char is to behave in the expected
way).

-t

> Thomas Lord <lord_at_emf.net> writes:
>
>
>> *Allowing* unpaired surrogates does not *require* that
>> unpaired surrogates be supported.
>>
>
> Permitting variation hinders portability, it leads to programs which
> run correctly only on a subset of implementations.
>
>
>> Tom> For every natural number (integers greater than or equal to 0)
>> Tom> there exists a distinct CHAR value. The set of all such values
>> Tom> are called "simple characters".
>>
>> John> Whatever for?
>>
>> So that the abstract model of character values is mathematically
>> simple and so that it is a good model for communications generally.
>>
>
> It's indeed simpler, but it's worse for communication, not better,
> because all the rest of the world uses Unicode within its limits.
>
>
>> It keeps the communications model in tact. An N-bit wide port,
>> in this model, conveys characters 0..2^N-1.
>>
>
> The range of Unicode code points is not a power of 2, so this can't
> express the code point port nor Unicode scalar value port.
>
> The current computing world doesn't use N-bit wide ports for arbitrary
> values of N. Almost all interchange formats are based on sequences of
> bytes, and there is no obvious mapping between an N-bit port and a
> byte port (there are several choices), so different pieces of software
> using N-bit ports can't necessarily communicate directly with each
> other, even for an agreed N.
>
>
>> You shorted yourself, then, by not getting to the topic of combining
>> sequence characters. Remember that, in addition to a simple
>> character for every integer, I'm also suggesting that all tuples
>> of simple characters are, themselves characters.
>>
>
> This requires an API to look inside characters, char->integer is no
> longer sufficient.
>
> This complicates exchange with the rest of the world, because almost
> all Unicode-based strings in various languages consist of some atomic
> units: either Unicode scalar values, or code points, or code units of
> some encoding form (UTF-8/16/32).
>
> I would design this differently: don't use a separate character type
> at all, identify them with strings of length 1. But it would break
> Scheme tradition (which goes back to Lisp tradition), so I'm not
> proposing this for Scheme.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r6rs.org/pipermail/r6rs-discuss/attachments/20061217/11707b63/attachment.html
Received on Sun Dec 17 2006 - 18:41:21 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC