[r6rs-discuss] questions for Unicode gurus from William D Clinger on 2007-04-01 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Sun Apr 1 10:47:33 2007

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.

Several issues came up while upgrading the reference
implementation of (r6rs unicode) to Unicode revision 5.0.0.

In the current draft R6RS, char-alphabetic? is true iff
the character is in categories Lu, Ll, Lt, Lm, or Lo.
In Unicode, however, a character has the Alphabetic
property iff it is in categories Lu, Ll, Lt, Lm, Lo, or
Nl, or is one of 511 Other_Alphabetic characters [1,2].
Shouldn't char-alphabetic? be changed to coincide with
Unicode's Alphabetic property?

That would make char-alphabetic? consistent with the
specifications of char-upper-case? and char-lower-case?.
With the current draft R6RS and Unicode 5.0.0, there are
43 lower-case and 42 upper-case characters for which
char-lower-case? or char-upper-case? return true but
char-alphabetic? returns false.

In the current draft R6RS, NEL (U+0085) is not considered
whitespace by char-whitespace?, despite the editors'
response to formal comment #22. I assume this was an
oversight and should be corrected.

Unicode defines the White_Space property by an explicit
enumeration [2]. Wouldn't it be better for the R6RS to
define char-whitespace? by reference to the Unicode
White_Space property, so it will stay in sync with Unicode
if more characters are added with that property?

In the current draft R6RS, string-titlecase is defined
in terms of contiguous sequences of cased characters,
ignoring case-ignorable characters. This fairly simple
specification (which I believe to have been taken from
Unicode 3) is inconsistent with Unicode 4 and 5, which
define the conversion to title case in terms of the word
boundaries specified by Unicode Standard Annex #29 (Text
Boundaries) [3]. Shouldn't string-titlecase be consistent
with the current Unicode specification?

Speaking of Unicode Standard Annex #29 (Text Boundaries),
its rules for determining word boundaries are defined in
terms of a property named "ALetter", whose definition is

    Alphabetic=true, or
    U+05F3 HEBREW PUNCTUATION GERESH
    and Ideographic=false
    and Word_Break != Katakana
    and LineBreak != Complex_Constant(SA)
    and Script != Hiragana
    and Grapheme_Extend=false

After tedious examination of WordBreakProperty.txt [4],
I have concluded that, contrary to the usual precedence
rules for "or" and "and", the above is to be read as

    (Alphabetic=true or U+05F3 HEBREW PUNCTUATION GERESH)
    and Ideographic=false
    and Word_Break != Katakana
    and LineBreak != Complex_Constant(SA)
    and Script != Hiragana
    and Grapheme_Extend=false

Can some Unicode guru please confirm this reading? And
if my reading is correct, can some Unicode guru please
ask the authors of Unicode to make this reading a little
more obvious? And if my reading is incorrect, what is
the correct reading?

By the way, does anyone knows of a good algorithm for
computing the ALetter property? You may assume I am
capable of constructing the obvious algorithms from the
above spec or from WordBreakProperty.txt, and am hoping
for something better.

Incidentally, the specification of word boundaries in
Unicode Standard Annex #29 also has implications for the
treatment of Greek sigma by string-downcase.

Will

[1] http://www.unicode.org/Public/UNIDATA/UCD.html#Properties
[2] http://www.unicode.org/Public/UNIDATA/PropList.txt
[3] http://www.unicode.org/reports/tr29/
[4] http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
Received on Sun Apr 01 2007 - 10:47:27 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC