IMHO the two best choices for the Unicode-based notion of a character
in a programming language are:
* code points: 0..#x10FFFF
* Unicode scalar values: 0..#x10FFFF excluding #xD800..#xD7FF
They are simple to understand in the context of Unicode, atomic,
easy to store in strings, and easy to exchange with other languages
with Unicode-based strings.
Thomas Lord <lord_at_emf.net> writes:
> *Allowing* unpaired surrogates does not *require* that
> unpaired surrogates be supported.
Permitting variation hinders portability, it leads to programs which
run correctly only on a subset of implementations.
> Tom> For every natural number (integers greater than or equal to 0)
> Tom> there exists a distinct CHAR value. The set of all such values
> Tom> are called "simple characters".
>
> John> Whatever for?
>
> So that the abstract model of character values is mathematically
> simple and so that it is a good model for communications generally.
It's indeed simpler, but it's worse for communication, not better,
because all the rest of the world uses Unicode within its limits.
> It keeps the communications model in tact. An N-bit wide port,
> in this model, conveys characters 0..2^N-1.
The range of Unicode code points is not a power of 2, so this can't
express the code point port nor Unicode scalar value port.
The current computing world doesn't use N-bit wide ports for arbitrary
values of N. Almost all interchange formats are based on sequences of
bytes, and there is no obvious mapping between an N-bit port and a
byte port (there are several choices), so different pieces of software
using N-bit ports can't necessarily communicate directly with each
other, even for an agreed N.
> You shorted yourself, then, by not getting to the topic of combining
> sequence characters. Remember that, in addition to a simple
> character for every integer, I'm also suggesting that all tuples
> of simple characters are, themselves characters.
This requires an API to look inside characters, char->integer is no
longer sufficient.
This complicates exchange with the rest of the world, because almost
all Unicode-based strings in various languages consist of some atomic
units: either Unicode scalar values, or code points, or code units of
some encoding form (UTF-8/16/32).
I would design this differently: don't use a separate character type
at all, identify them with strings of length 1. But it would break
Scheme tradition (which goes back to Lisp tradition), so I'm not
proposing this for Scheme.
--
__("< Marcin Kowalczyk
\__/ qrczak_at_knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Received on Sun Dec 17 2006 - 18:05:42 UTC