[r6rs-discuss] Strings as codepoint-vectors: bad from John Cowan on 2007-03-16 (r6rs-discuss.mbox)

From: John Cowan <cowan>
Date: Fri Mar 16 02:54:44 2007

Jason Orendorff scripsit:

> I guess we just have to disagree. Both cases involve a character
> being botched because software broke the data at an inappropriate
> boundary. To me, they're not just analogous; they're practically
> identical. I'm trying to imagine how I would explain the distinction
> to my wife. Drawing a blank here.

My wife is not at all technical, but she has a lot of patience with me,
so the conversation would go like this:

* John points to a lower-case a with ogonek and acute.

John: This is a letter used to write Navajo. You'll notice that there's
a hook below the "a" and a small accent mark above it?

* Gale notices.

John: The hook indicates that instead of the letter being pronounced
"ah", it's pronounced like the "an" in the French word _blanc_, meaning
'white'. And the acute accent above the letter means that it's pronounced
with a high-pitched tone, like "an" (high pitch), not "an" (low pitch).

Gale: Say that again, more slowly?

* John repeats what he said, including the "an" sound with high and low
pitch.

Gale: Okay.

John: Now in the computer we represent letters by numbers. Most letters
are represented by just one number. For example, the number for plain
"a" is 65. But this Navajo letter is represented by two numbers.
The first number, which is 261, represents the a with the hook, and the
second number, which is 769, represents the acute accent.

* Gale nods.

John: Suppose I wanted to put the letter "b" after this letter. If I
did things right, there would then be three numbers in the computer: 261,
769, and 66, which is the number for "b". But if I screwed up, and put
the number 66 in between, then I'd get an a with a hook, followed by a
b with an acute. It wouldn't make any sense.

Gale: Obviously.

[So far so good. But now consider this part:]

John: Next comes the letter "ahsa". It's the first letter of the Gothic
alphabet, which was used to write the Gothic language back in the fifth
century. As you can see, it looks sort of like an "A", and it has the
"ah" sound in Gothic, at least most of the time.

John: In the computer, we also represent this letter with two numbers.

Gale: Why two numbers? It doesn't have two parts, like the other one.

John: No, it doesn't. They do it that way for stupid historical reasons.

Gale: Well, if you say so.

John: Now if I wanted the next Gothic letter, which is called "berkan"
and is almost exactly like the letter "B", to go after it, then I'd have
four numbers. All the Gothic letters need two numbers each.

Gale: They do?

John: Afraid so. Well, there is a way to represent "ahsa" and "berkan"
with one number each. But the numbers are too big to fit in a regular
computer's memory, so the computer likes it better if we break them into
two parts.

Gale: Whatever.

John: So if I put the two numbers for "berkan" after the first number
for "ahsa", then the computer would show a blank box, the "berkan",
and another blank box.

Gale: It would? How stupid.

> Most systems recover from the former error by losing the one broken
> character (some systems replace it with '?'; some render a blank box)

Two blank boxes, if you have separated the parts of a surrogate pair.
And you have no clue what those boxes were supposed to mean.

> Most systems recover from the latter error by silently discarding the
> orphaned combining marks.

Actually they don't. Usually the combining mark is placed over the
intruded character, as in the dialogue above. (Try it and see.)
If that's impossible, as when the intruded character is a newline,
then the combining mark is placed on a dotted-circle glyph. This
may also be done if the renderer does not know how to handle the
combination, as when you intrude Devanagari "ka" between a Hebrew
consonant letter and its vowel point.

-- 
MEET US AT POINT ORANGE AT MIDNIGHT BRING YOUR DUCK OR PREPARE TO FACE WUGGUMS
John Cowan      cowan_at_ccil.org      http://www.ccil.org/~cowan

Received on Fri Mar 16 2007 - 02:54:40 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC