[r6rs-discuss] unicode (re comment #134)

From: Shiro Kawai <shiro>
Date: Sun Dec 17 19:42:25 2006

I agree to Tom Lord that an implementation may not make char the
same as Unicode code points. As one concrete example, Gauche
can be compiled to use JISX0213 as its "native" character set,
which includes characters that are not a part of Unicode.

However, if integer->char is supposed to work portably, I think
it is reasonable to require its domain to Unicode scalar values.
There's no way to put JISX0213-specific characters into it anyway.
Instead, I'd make an implementation-specific routine, something
like native-codepoint->char.

BTW, if we can leave from backward compatibility here, I'd prefer
more specific name such as ucs->char instead of integer->char.
Then there's less room of discussion on what it should do, even
if the implementation use non-unicode character set or encoding.

--shiro


From: Thomas Lord <lord_at_emf.net>
Subject: Re: [r6rs-discuss] unicode (re comment #134)
Date: Sun, 17 Dec 2006 15:41:21 -0800

> Marcin 'Qrczak' Kowalczyk wrote:
> > IMHO the two best choices for the Unicode-based notion of a character
> > in a programming language are:
> >
> > * code points: 0..#x10FFFF
> > * Unicode scalar values: 0..#x10FFFF excluding #xD800..#xD7FF
> >
> > They are simple to understand in the context of Unicode, atomic,
> > easy to store in strings, and easy to exchange with other languages
> > with Unicode-based strings.
> >
>
> I agree, more or less. I would add "a superset of code points"
> to the mix.
>
> The issue raised in comment #134 is that, as written, the draft
> *requires* that char be "unicode scalar value". An implementation
> may not make char the same as code points, or a superset of
> code points (at least if integer->char is to behave in the expected
> way).
>
> -t
>
>
>
>
>
>
>
> > Thomas Lord <lord_at_emf.net> writes:
> >
> >
> >> *Allowing* unpaired surrogates does not *require* that
> >> unpaired surrogates be supported.
> >>
> >
> > Permitting variation hinders portability, it leads to programs which
> > run correctly only on a subset of implementations.
> >
> >
> >> Tom> For every natural number (integers greater than or equal to 0)
> >> Tom> there exists a distinct CHAR value. The set of all such values
> >> Tom> are called "simple characters".
> >>
> >> John> Whatever for?
> >>
> >> So that the abstract model of character values is mathematically
> >> simple and so that it is a good model for communications generally.
> >>
> >
> > It's indeed simpler, but it's worse for communication, not better,
> > because all the rest of the world uses Unicode within its limits.
> >
> >
> >> It keeps the communications model in tact. An N-bit wide port,
> >> in this model, conveys characters 0..2^N-1.
> >>
> >
> > The range of Unicode code points is not a power of 2, so this can't
> > express the code point port nor Unicode scalar value port.
> >
> > The current computing world doesn't use N-bit wide ports for arbitrary
> > values of N. Almost all interchange formats are based on sequences of
> > bytes, and there is no obvious mapping between an N-bit port and a
> > byte port (there are several choices), so different pieces of software
> > using N-bit ports can't necessarily communicate directly with each
> > other, even for an agreed N.
> >
> >
> >> You shorted yourself, then, by not getting to the topic of combining
> >> sequence characters. Remember that, in addition to a simple
> >> character for every integer, I'm also suggesting that all tuples
> >> of simple characters are, themselves characters.
> >>
> >
> > This requires an API to look inside characters, char->integer is no
> > longer sufficient.
> >
> > This complicates exchange with the rest of the world, because almost
> > all Unicode-based strings in various languages consist of some atomic
> > units: either Unicode scalar values, or code points, or code units of
> > some encoding form (UTF-8/16/32).
> >
> > I would design this differently: don't use a separate character type
> > at all, identify them with strings of length 1. But it would break
> > Scheme tradition (which goes back to Lisp tradition), so I'm not
> > proposing this for Scheme.
> >
> >
>
Received on Sun Dec 17 2006 - 19:42:42 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC