[r6rs-discuss] Strings as codepoint-vectors: bad

From: Jason Orendorff <jason.orendorff>
Date: Thu Mar 15 11:13:35 2007

I think people who favor strings-as-codepoint-vectors must think that the
codepoint is a good level of abstraction for text. Really it's not.

   One or more Unicode characters may make up what the user thinks of
   as a character or basic unit of the language. To avoid ambiguity
   with the computer use of the term character, this is called a
   grapheme cluster. For example, `G' + acute-accent is a grapheme
   cluster: it is thought of as a single character by users, yet is
   actually represented by two Unicode code points.

   -- Unicode Standard Annex #29
   http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

In Java, C#, and in all likelihood Python 3.0, strings are immutable
sequences of 16-bit values (UTF-16 code units). Surrogate pairs are
totally ignored.

This is a good design.

Treating a string as a sequence of Unicode codepoints has few
real-world use cases. For ordinary text-munging, we use higher-level
functions such as (string-append), (string-find), (string-replace),
(string-starts-with?), and so on. In other words, the objects we want
to use when working with strings are... substrings. Note that all
these useful functions can be implemented "naively" in terms of UTF-16
code units and they'll work just fine, even on surrogate pairs.

The only use cases I know of for codepoint sequences are to implement
Unicode algorithms, like laying out bidirectional text. Here UTF-16
is no real burden compared to the sheer complexity of the task at
hand. (See http://unicode.org/reports/tr9/ for example.)

By contrast, passing a UTF-16 string to some external function is an
extremely common and important use case. It's especially important on
Windows and for anything that targets the JVM or CLR.

I think people who favor strings-as-codepoint-vectors must also think
that breaking a surrogate pair is really bad. But even with a
codepoint-centric view of text you can unwittingly break a grapheme
cluster, which amounts to the same sort of bug--it can lead to garbled
text--and which is probably much *more* common in practice. I never
hear anyone complain about that.

Making strings vectors of 16-bit values is simple, familiar,
speed-efficient, memory-efficient, easy to implement, and convenient
for programmers.

-j
Received on Thu Mar 15 2007 - 11:13:15 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC