[R6RS] Unicode titlecase algorithm

Thu May 10 07:20:54 EDT 2007

Mike wrote:
> We decided to adopt Will's suggestion of adopting the official Unicode
> titlecase algorithm.  Unfortunately, I can't find a specification of it
> on the web right now: The Unicode site says PDFs for the standard are
> temporarily offline.  Could somebody help me out?

If you're asking for citations, you should cite:

    the main Unicode standard (currently 5.0)
    Unicode Standard Annex #29
        ( http://www.unicode.org/reports/tr29/ )

If you're asking for a simple specification of the
algorithm, there isn't one.  About the best you can
do is to change the sentence that, in 5.92, says

    The string-titlecase procedure converts the
    first character to title case in each
    contiguous sequence of cased characters
    within string, and it downcases all other
    cased characters; for the purposes of
    detecting cased-character sequences,
    case-ignorable characters are ignored (i.e.
    they do not interrupt the sequence).

to

    The string-titlecase procedure converts the
    first cased character of each word to title
    case, and downcases all other cased characters.

That sounds simple, but the hair lies in the
definition of a word.  The Unicode standard
explicitly defers to UAX 29 on this point, and
the definition of a word in UAX 29 is neither
simple nor categorical.  (In particular, word
breaking is allowed to be locale-sensitive.)

The fact that the Unicode committee understands
that categorical specifications are not always
desirable does not bother me, of course; I point
it out only because it might bother some of the
other editors.

I guess I should also point out that UAX 29 is
implicitly part of the specification for both
string-downcase and string-foldcase, since the
casing of Greek sigma is defined with respect to
word breaks, which are defined by UAX 29.

Will