[R6RS] Unicode normalization

Thu Mar 2 13:04:05 EST 2006

After experimenting with Unicode normalization and re-reading the SRFI
discussion, I propose the following simple change to the SRFI:

 * Add `string-normzlize-nfd',
       `string-normzlize-nfkd',
       `string-normzlize-nfc', and
       `string-normzlize-nfkc'.
   Each of these procedures takes a string and return its D, KD, C, or
   KC normalization, respectively.

Originally, I imagined just picking one. But the one I would have
picked is NFC, and but the time you have NFC, it's a small step to have
all four. Also, all of them are useful to programs that deal with
Unicode somewhat explicitly.

I recommend against using any normalization for symbols or input
streams. Here's my rationale:

   * Normalizing symbols without normalizing strings will lead to
     confusion, since 'ê (that's a quote followed by U+00EA) would not
     be the same as (string->symbol "ê") with NFD or NFKD
     normalization. If NFC or NFKC is used, adjust the example by
     decomposing ê to two characters in the string.

   * For consistency overall, normalization needs to be pushed down to
     the lexical level. Otherwise, #\ê might turn out to be the ê
     character, or it might be a syntax error (i.e., a #\e followed by
     anon-delimiting character). In other words, we'd have to either
     require a program to be represented as a normalize stream of
     characters, or specify normalization as part of the parsing
     process.

   * Normalizing things like strings probably interferes with
     representing literal data in strings. For example, I would guess
     that pathnames like "ê" typically use NFC-like encodings, but I'm
     not sure.

By not specifying normalization, however, we push the problem into
programming environments and editors. If a programmer types an "ê", for
example, the specific meaning will depend on how the editor saves the
program text. So, I'm not sure it's the right approach.

Matthew