> If string-ref also required O(1) time complexity, then you'd be right.
> But it doesn't; it's perfectly fine to implement string-ref on top of
> underlying UTF-8 or UTF-16 character sequences; you just have to settle
> for O(N) performance.
Are you suggesting that indexes represent code points rather than code
units? I haven't seen anyone do that, not as the one-and-only interface to
elements of a string. Have you? And do you think UTF-8/UTF-16
implementations should be *required* to do that? (Obviously, then,
string-length would have to return the number of code points rather than
the number of code units.)
It's interesting that you bring up the point about O(1) complexity. That
is, of course, the assumption people currently make. If the assumption
wasn't justified that should probably be made formal, since it would
affect the way people write string processing algorithms.
Note: Perhaps a solution is to have two variants of the procs, one for
code points and one for code units. The code units variants would
guarantee O(1) and the code point ones wouldn't.
> Python has been suffering through that for several years now, and has
> decided to break backward compatibility and abandon the 8-bit strings --
> but using the 8-bit names for Unicode strings. I don't know what the
> internal implementation is.
John, I can't find any support for that, at least not among the developer
mailing list summaries at
http://www.python.org/dev/summary/ nor among the
Python Enhancement Proposals (PEPs) at
http://www.python.org/dev/peps/.
Here are the Unicode-related ones that I *could* find:
Python Unicode Integration [Final]
http://www.python.org/dev/peps/pep-0100/
Support for "wide" Unicode characters [Final]
http://www.python.org/dev/peps/pep-0261/
Unicode file name support for Windows NT [Final]
http://www.python.org/dev/peps/pep-0277/
Byte vectors and String/Unicode Unification [Rejected]
http://www.python.org/dev/peps/pep-0332/
Allow str() to return unicode strings [Deferred]
http://www.python.org/dev/peps/pep-0349/
Note that the last PEP, dated August 2005, references Python 2.5 and is
deferred. Here is what the Rationale text says:
Python has had a Unicode string type for some time now but use of
it is not yet widespread. There is a large amount of Python code
that assumes that string data is represented as str instances.
The long term plan for Python is to phase out the str type and use
unicode for all string data. Clearly, a smooth migration path
must be provided.
The PEP is old, but it's still deferred.
Received on Thu Mar 15 2007 - 11:12:54 UTC