[r6rs-discuss] Strings

From: Jason Orendorff <jason.orendorff>
Date: Fri Mar 23 12:35:21 2007

On 3/22/07, Alexander Kjeldaas <alexander.kjeldaas_at_gmail.com> wrote:
> Python is *definitively* not utf16. Python can be compiled to use
> utf8, utf16 or utf32/ucs4.

UTF-16 or UTF-32. Not UTF-8.

I'll ask around and see if the Python folks think this has been good,
bad, or indifferent. My impression was that it's considered to have
been a mistake, but I could be wrong.

My thoughts on this topic actually come largely from Python's
experience in this arena.

> Python does not have a character type,
> avoiding the issue of whether there should be O(1) access to
> characters.

Um, this is a misunderstanding of how Python works. Python
provides O(1) access to code units, so for example on a
"ucs2" build (the default):

>>> s = u'\U00012345'
>>> len(s)
  2
>>> s[0]
  u'\ud808'

On a "ucs4" build the same code gives different answers. No one
exactly likes this in the Python camp, and I don't think we want this
for Scheme. If R6RS exposes code units, it should either
standardize on a representation everyone can live with; or
set the code unit API aside in a separate library, maybe
(r6rs string-code-units), so people won't naively trip over it.

> According to Guido van Rossum, python 3000 might use all
> three internal representations at the same time.

Well, it's possible. I think he mentioned it at PyCon. I'll gladly
bet it doesn't change: too much work, and it would either complicate
the Python C API (one of Python's major strings--er, strengths) or
hurt performance, or both.

I'll ask about this too.

> Neither Xerces-C nor ICU specifies their internal representation as
> part of the interface AFAIK. On the other hand, since they deal with
> with encodings they support lots of them.

Xerces-C:

  "String is represented by 'XMLCh*' which is a pointer to unsigned
  16 bit type holding utf-16 values, null terminated."

  http://xml.apache.org/xerces-c/ApacheDOMC++BindingL2.html

ICU:

  "In ICU, a Unicode string consists of 16-bit Unicode code units.
  A Unicode character may be stored with either one code unit
  (the most common case) or with a matched pair of special
  code units ("surrogates"). The data type for code units is UChar.
  [...]

  "Indexes and offsets into and lengths of strings always count
  code units, not code points."

  http://www.icu-project.org/apiref/icu4c/classUnicodeString.html#_details

Regarding the rest of your comments: your experience and mine
obviously differ. I wonder if you have profiled a system using both
UTF-16 and UTF-32 strings. I have not.

I think the rate-determining step is probably neither unaligned
accesses nor processor cache but how much copying and transcoding
you're forced to do. UTF-16 is a significant win in that regard.

-j
Received on Fri Mar 23 2007 - 12:34:57 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC