[R6RS] Unicode normalization
Matthew Flatt
mflatt at cs.utah.edu
Thu Mar 2 13:04:05 EST 2006
After experimenting with Unicode normalization and re-reading the SRFI
discussion, I propose the following simple change to the SRFI:
* Add `string-normzlize-nfd',
`string-normzlize-nfkd',
`string-normzlize-nfc', and
`string-normzlize-nfkc'.
Each of these procedures takes a string and return its D, KD, C, or
KC normalization, respectively.
Originally, I imagined just picking one. But the one I would have
picked is NFC, and but the time you have NFC, it's a small step to have
all four. Also, all of them are useful to programs that deal with
Unicode somewhat explicitly.
I recommend against using any normalization for symbols or input
streams. Here's my rationale:
* Normalizing symbols without normalizing strings will lead to
confusion, since 'ê (that's a quote followed by U+00EA) would not
be the same as (string->symbol "ê") with NFD or NFKD
normalization. If NFC or NFKC is used, adjust the example by
decomposing ê to two characters in the string.
* For consistency overall, normalization needs to be pushed down to
the lexical level. Otherwise, #\ê might turn out to be the ê
character, or it might be a syntax error (i.e., a #\e followed by
anon-delimiting character). In other words, we'd have to either
require a program to be represented as a normalize stream of
characters, or specify normalization as part of the parsing
process.
* Normalizing things like strings probably interferes with
representing literal data in strings. For example, I would guess
that pathnames like "ê" typically use NFC-like encodings, but I'm
not sure.
By not specifying normalization, however, we push the problem into
programming environments and editors. If a programmer types an "ê", for
example, the specific meaning will depend on how the editor saves the
program text. So, I'm not sure it's the right approach.
Matthew
More information about the R6RS
mailing list