[R6RS] draft Unicode SRFI
Marc Feeley
feeley
Wed Jul 6 09:53:16 EDT 2005
I have started to implement the Unicode SRFI in Gambit and have
encountered some problems. The Unicode collation algorithm, which is
used by string-locale=?, etc, is very complex, see
http://www.unicode.org/unicode/reports/tr10/
The algorithm uses non-trivial collation tables for each locale.
There is no point to reimplement all of this from scratch and it
makes sense to base the implementation on the wide character C
library function "wcscoll". Unfortunately this function is not
supported equally well by all C libraries, for example with Mac OS X
the man page says:
BUGS
The current implementation of wcscoll() only works in single-
byte
LC_CTYPE locales, and falls back to using wcscmp() in locales
with
extended character sets.
Moreover, assuming wcscoll had widespread support, it is not clear to
me how to implement the case-independent variants string-locale-
ci=?, ... because wcscoll does not support a case-independent
comparison. The best I can come up with is to downcase both strings
in a locale dependent way, and then use wcscoll. But how do you
downcase in a locale dependent way? I thought the C library function
"iconv" could be used for this, but it seems to only be useful for
converting between different character encodings.
I'm beginning to wonder if it is a good idea to put the locale
specific string procedures in the language. The runtime system will
be larger (in binary code size and in various tables) and we don't
seem to be able to pin down a definition of what a "locale" is and
how the locale is specified to the runtime system. I think that,
given the support in R6RS for Unicode strings, all the locale
dependent string operations can be written portably and placed in a
"locale" library. Wouldn't this make more sense?
Marc
More information about the R6RS
mailing list