[r6rs-discuss] Strings

From: Alexander Kjeldaas <alexander.kjeldaas>
Date: Thu Mar 22 13:03:49 2007

On 3/21/07, MichaelL_at_frogware.com <MichaelL_at_frogware.com> wrote:
> Jason Orendorff wrote:
>
> > And most (but not all) Unicode string implementations use UTF-16.
> > Among languages and libraries that are very widely used, the majority
> > is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
> > Xerces-C, and on and on.
>
> (...and Windows and Mac and IBM's ICU and PHP 6 and...)
>

They don't use UTF-16 by choice.

Most of these systems that *do* use UTF-16 started out using UCS-2.
Only after unicode was extended beyond the 16-bit range did they
change to UTF-16.

To me it is an axiom that any system defining a *character* to be
16-bit and wants to support unicode natively is broken (and nobody is
proposing that, I know..).

But still systems have 16-bit char types. Why?

Of course systems that were designed at a time when unicode was 16-bit
would choose 16-bit unicode code points as characters if they needed
to define the representation for characters.

And probably because of exactly the issues being discussed in these
threads (O(1) access to strings etc), these languages _defined_
characters to be 16-bit and are now stuck. Basically their only
choice is to upgrade their representation to utf-16 and add a lot of
complexity to deal with the issues. In some cases they will choose to
remain broken.

An example is IsNumber(char) in CLR. Of course, you need
IsNumber(string, int) to check for a number at a particular index in
the string because the first function was broken.


For javascript, there is a proposal to fix the 16-bit unicode issue in
ecmascript 4:
http://developer.mozilla.org/es4/proposals/update_unicode.html

Python is *definitively* not utf16. Python can be compiled to use
utf8, utf16 or utf32/ucs4. Python does not have a character type,
avoiding the issue of whether there should be O(1) access to
characters. According to Guido van Rossum, python 3000 might use all
three internal representations at the same time.

Qt can be compiled for various internal representations.

Neither Xerces-C nor ICU specifies their internal representation as
part of the interface AFAIK. On the other hand, since they deal with
with encodings they support lots of them.


The only thing that is good about utf-16 is that it makes interfacing
with *certain* other languages easy. However, it does not make it easy
to do i/o, and for a sufficiently large system, you will have to
interface with systems using at least latin1, utf8 and utf16. Looking
forward, I expect that you will have to interface with systems using
utf-32 as well. ICU docs say that they use utf-16 because it is
faster than utf-32. My experience is that using utf-32 is superior to
utf-16.

If you have a large set of strings in your application and need to
save memory, you want to convert your strings to an encoding that
saves memory - i.e. utf-8 in most cases. If you are interested in
simplicity, you want O(1) access to characters - i.e. utf-32. If you
want speed and complexity, you want utf-8, utf-16 or utf-32 depending
on your character distribution.

My experience is that in the 99% of cases where you have a reasonable
number of strings in memory, utf-32 is what you want. In the 1% of
cases where memory matters, utf8 or utf16 is what you want, but it
depends on the application.

For I/O you need to be able to specify the encoding. Thus to me the
simple and obvious solution, if you wanted to start from scratch, is
to have strings as an array of 21-bit characters, and to support
encoding when you want to talk to the rest of the world. This is what
SBCL does. Since some implementations don't start from scratch - they
want to run on top of JVM and CLR etc, they need to be able to use
utf-16, but that does not invalidate the fact that utf-16 would not be
a good choice if you had a choice.

Basic support:
- Specify O(1) access to code units.
- Specify an iterator interface for strings.
- Support encodings in the i/o interfaces or conversion functions.

Extended support:
- Also specify O(1) access to characters. This requires utf-32 or
similar as internal representation.

> On a different note, I find this desire to shield programmers from code
> units odd and senseless. If R6RS intends Scheme to be a higher-level
> language that abstracts away representation issues why is it adding
> fixnums and flonums? Why do bytevectors have operations that get and set
> singles and doubles?
>

I agree.

Alexander
Received on Thu Mar 22 2007 - 13:03:36 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC