[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)
Not bucky bits, but one potential example of utilizing
more bits per character.
If I'm writing an editor for multilingual text, I need
to put language attributes to the region in the buffer,
since the same character should necessarily be rendered
differently (For CJK unified characters, glyphs are
different for each language; using fonts created for
another language makes them look very weird. And I
want one level of indirection, e.g. specifying a language
then select fonts for that language, instead of directly
associating fonts for each region).
Suppose I want to use Scheme as the extension language of
the editor. It will have an operation to extract a region
of the buffer as a Scheme string. And it will be useful
if the extracted string contains language information as
well, for I might want to do language-specific operations.
Using 32bits per character and put auxiliary language info
into the top 11 bits can be a plausible implementation.
(At least it looks better than using "strongly discouraged"
Unicode language tag characters). R6RS string/character
operation may just ignore those aux bits, which is fine,
but I like the standard allows me to add extensions that
deals with them. I suspect locking into utf-16 prohibits
such extension.
(Of course, in the editor buffer, things gets more complicated
because of combining characters, but I'm thinking the case when
I extract a part of it into Scheme strings).
(I think Emacs treats characters of different language by
adding leading octet unique to each language. With that it
can properly distinguish Japanese and Chinese characters
even if they are mixed in a single document. I doubt most
unicode plaintext editor can do that.)
--shiro
Received on Mon Mar 26 2007 - 05:56:21 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC