[r6rs-discuss] Unicode version

From: John Cowan <cowan>
Date: Fri Feb 16 17:21:21 2007

Bob Burger scripsit:

> To see how much changed, I downloaded CaseFolding.txt for versions 4.1.0
> and 5.0.0 from http://www.unicode.org/Public/{4.1.0,5.0.0}/ucd/. In
> one year we have the changes given below in diff format. There are
> over 1400 changes to UnicodeData.txt!

A great many of the changes to UnicodeData are due to the introduction
of new characters, however.

> This rate of change makes me wonder how I'm supposed to write robust
> software that uses these locale-independent functions without getting
> burned by updates and upgrades in the future. Thoughts?

You happened to hit a particularly bad patch for case pairs, because as
part of 5.0 an effort was made to make sure that every Lu character has a
Ll counterpart (though not vice versa, which is why Unicode case folding
is always toward lowercase). This accounts for all the differences you
found, with the following exceptions:

> < 0241; C; 0294; # LATIN CAPITAL LETTER GLOTTAL STOP

This was cleaning up a mess. Originally there was only GLOTTAL STOP,
used basically in IPA, and as such lowercase. However, it is as tall
as typical capital letters, and when adopted into certain practical
orthographies it was used as the uppercase form and a new lowercase form
was created.

In 4.1, CAPITAL GLOTTAL STOP was added, which formed a casing pair
with the existing GLOTTAL STOP. This was unsatisfactory, however, as
the *uppercase* form is identical in appearance to the caseless form
(which shares a codepoint with the different-looking lowercase form).
In 5.0, therefore, SMALL GLOTTAL STOP was added to form an ordinary
casing pair, and plain GLOTTAL STOP has been changed to Lo to indicate
its caseless status.

> >0246; C; 0247; # LATIN CAPITAL LETTER E WITH STROKE
> >0248; C; 0249; # LATIN CAPITAL LETTER J WITH STROKE
> >024A; C; 024B; # LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
> >024C; C; 024D; # LATIN CAPITAL LETTER R WITH STROKE
> >024E; C; 024F; # LATIN CAPITAL LETTER Y WITH STROKE
> >04FA; C; 04FB; # CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK
> >04FC; C; 04FD; # CYRILLIC CAPITAL LETTER HA WITH HOOK
> >04FE; C; 04FF; # CYRILLIC CAPITAL LETTER HA WITH STROKE
> >0510; C; 0511; # CYRILLIC CAPITAL LETTER REVERSED ZE
> >0512; C; 0513; # CYRILLIC CAPITAL LETTER EL WITH HOOK
> >2183; C; 2184; # ROMAN NUMERAL REVERSED ONE HUNDRED
> >2C60; C; 2C61; # LATIN CAPITAL LETTER L WITH DOUBLE BAR
> >2C69; C; 2C6A; # LATIN CAPITAL LETTER K WITH DESCENDER
> >2C6B; C; 2C6C; # LATIN CAPITAL LETTER Z WITH DESCENDER
> >2C75; C; 2C76; # LATIN CAPITAL LETTER HALF H

All of these represent novel letters introduced in 5.0 in both lowercase
and uppercase forms, necessitating new CaseFolding entries. You can't
avoid these, but they don't represent backwards incompatible changes,
since neither character existed before.

Anyhow, the good news is that case folding is, as of 5.0, covered
by a formal Unicode stability policy. If a text is case-folded
in any version >= 5.0, it will still be case-folded in any later
version. See http://www.unicode.org/standard/stability_policy.html#Case
for the formal statement of this policy.

-- 
With techies, I've generally found              John Cowan
If your arguments lose the first round          http://www.ccil.org/~cowan
    Make it rhyme, make it scan                 cowan_at_ccil.org
    Then you generally can
Make the same stupid point seem profound!           --Jonathan Robie
Received on Fri Feb 16 2007 - 17:21:14 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC