[r6rs-discuss] Strings as codepoint-vectors: bad

From: Thomas Lord <lord>
Date: Fri Mar 16 02:25:48 2007

(Mostly just unpacking Cowan's points a bit.)

Jason Orendorff wrote:
> John Cowan wrote:
>> Jason Orendorff scripsit:
>> > I think people who favor strings-as-codepoint-vectors must also think
>> > that breaking a surrogate pair is really bad. But even with a
>> > codepoint-centric view of text you can unwittingly break a grapheme
>> > cluster, which amounts to the same sort of bug--it can lead to garbled
>> > text--and which is probably much *more* common in practice. I never
>> > hear anyone complain about that.
>>
>> I absolutely disagree that these two problems are analogous at all:
>
> I guess we just have to disagree. Both cases involve a character
> being botched because software broke the data at an inappropriate
> boundary. To me, they're not just analogous; they're practically
> identical. I'm trying to imagine how I would explain the distinction
> to my wife. Drawing a blank here.

You'd have to explain the tower of representations. You could
compare a single wrod with a typo in it to a
sentence or phrase out of with some words order -- then explain
the analogy to encoding and character composition.

As usual in this area, it is hard to sort (for example) your and Cowan's
discussion
out because of unqualified uses of the word "character" and not enough
precision in distinguishing layers of the representation tower for
a technical audience.

But if we did sort that out, your main point is along the lines of
saying that similar errors in low-level string manipulation (off-by-one
errors and similar) create both bugs and, either way, you get garbage.
Cowan's point is that the two bugs, even if the same coding errors
result in them, have different impact on basic
unicode algorithms. For example, you can translate a garbled grapheme
cluster to utf-8 just fine but, strictly speaking, not so an isolated
surrogate -- so presumably systems will tend to degrade more gracefully
if they only have one of those two kinds of bugs.


>
>> Separating surrogate pairs is (a) UTF-16 specific and (b) leaves the
>> result uninterpretable. Gumming up a grapheme cluster is more like
>> an off-by-one error in inserting a character: the output is garbled
>> but not garbage.
>

What he said.

-t


> Most systems recover from the former error by losing the one broken
> character (some systems replace it with '?'; some render a blank box)
> and interpreting everything else just fine. I don't know what you
> mean by "uninterpretable".
>
> Most systems recover from the latter error by silently discarding the
> orphaned combining marks.
>
> (shrug) I don't see how the first one is more annoying than slow
> software, while the second one is negligible--especially given that
> surrogate pairs are extremely rare in practice (few people's names
> contain Byzantine musical symbols or Kharos.t.hi- letters) compared
> to, you know, accents.
>
> -j
> ------------------------------------------------------------------------
>
> _______________________________________________
> r6rs-discuss mailing list
> r6rs-discuss_at_lists.r6rs.org
> http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r6rs.org/pipermail/r6rs-discuss/attachments/20070315/b20ad30e/attachment.htm
Received on Fri Mar 16 2007 - 02:35:12 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC