On 3/22/07, Alexander Kjeldaas <alexander.kjeldaas_at_gmail.com> wrote:
> Python is *definitively* not utf16. Python can be compiled to use
> utf8, utf16 or utf32/ucs4.
UTF-16 or UTF-32. Not UTF-8.
I'll ask around and see if the Python folks think this has been good,
bad, or indifferent. My impression was that it's considered to have
been a mistake, but I could be wrong.
My thoughts on this topic actually come largely from Python's
experience in this arena.
> Python does not have a character type,
> avoiding the issue of whether there should be O(1) access to
> characters.
Um, this is a misunderstanding of how Python works. Python
provides O(1) access to code units, so for example on a
"ucs2" build (the default):
>>> s = u'\U00012345'
>>> len(s)
2
>>> s[0]
u'\ud808'
On a "ucs4" build the same code gives different answers. No one
exactly likes this in the Python camp, and I don't think we want this
for Scheme. If R6RS exposes code units, it should either
standardize on a representation everyone can live with; or
set the code unit API aside in a separate library, maybe
(r6rs string-code-units), so people won't naively trip over it.
> According to Guido van Rossum, python 3000 might use all
> three internal representations at the same time.
Well, it's possible. I think he mentioned it at PyCon. I'll gladly
bet it doesn't change: too much work, and it would either complicate
the Python C API (one of Python's major strings--er, strengths) or
hurt performance, or both.
I'll ask about this too.
> Neither Xerces-C nor ICU specifies their internal representation as
> part of the interface AFAIK. On the other hand, since they deal with
> with encodings they support lots of them.
Xerces-C:
"String is represented by 'XMLCh*' which is a pointer to unsigned
16 bit type holding utf-16 values, null terminated."
http://xml.apache.org/xerces-c/ApacheDOMC++BindingL2.html
ICU:
"In ICU, a Unicode string consists of 16-bit Unicode code units.
A Unicode character may be stored with either one code unit
(the most common case) or with a matched pair of special
code units ("surrogates"). The data type for code units is UChar.
[...]
"Indexes and offsets into and lengths of strings always count
code units, not code points."
http://www.icu-project.org/apiref/icu4c/classUnicodeString.html#_details
Regarding the rest of your comments: your experience and mine
obviously differ. I wonder if you have profiled a system using both
UTF-16 and UTF-32 strings. I have not.
I think the rate-determining step is probably neither unaligned
accesses nor processor cache but how much copying and transcoding
you're forced to do. UTF-16 is a significant win in that regard.
-j
Received on Fri Mar 23 2007 - 12:34:57 UTC