[r6rs-discuss] Strings as codepoint-vectors: bad from Jason Orendorff on 2007-03-16 (r6rs-discuss.mbox)

From: Jason Orendorff <jason.orendorff>
Date: Fri Mar 16 01:37:22 2007

John Cowan wrote:
> Jason Orendorff scripsit:
> > I think people who favor strings-as-codepoint-vectors must also think
> > that breaking a surrogate pair is really bad. But even with a
> > codepoint-centric view of text you can unwittingly break a grapheme
> > cluster, which amounts to the same sort of bug--it can lead to garbled
> > text--and which is probably much *more* common in practice. I never
> > hear anyone complain about that.
>
> I absolutely disagree that these two problems are analogous at all:

I guess we just have to disagree. Both cases involve a character
being botched because software broke the data at an inappropriate
boundary. To me, they're not just analogous; they're practically
identical. I'm trying to imagine how I would explain the distinction
to my wife. Drawing a blank here.

> Separating surrogate pairs is (a) UTF-16 specific and (b) leaves the
> result uninterpretable. Gumming up a grapheme cluster is more like
> an off-by-one error in inserting a character: the output is garbled
> but not garbage.

Most systems recover from the former error by losing the one broken
character (some systems replace it with '?'; some render a blank box)
and interpreting everything else just fine. I don't know what you
mean by "uninterpretable".

Most systems recover from the latter error by silently discarding the
orphaned combining marks.

(shrug) I don't see how the first one is more annoying than slow
software, while the second one is negligible--especially given that
surrogate pairs are extremely rare in practice (few people's names
contain Byzantine musical symbols or Kharo??h? letters) compared
to, you know, accents.

-j
Received on Fri Mar 16 2007 - 01:37:10 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC