[r6rs-discuss] Comparing Strings

From: John Cowan <cowan>
Date: Wed Feb 14 14:08:41 2007

MichaelL_at_frogware.com scripsit:

> The C runtime library has two string comparison functions, strcmp and
> strcoll. strcmp is not locale-aware while strcoll is. Some implementations
> add case-insensitive variants of those functions, stricmp and stricoll.

That's because C uses a pre-Unicode model of l10n where your character
encoding is part of your locale, as if it were truly a cultural issue
whether you use ASCII or Latin-1. On a more modern view, character
encoding is an attribute of the external representation of particular
data.

> I believe the main issue is that strings use a different case folding
> algorithm than characters do.

That is because character case conversion is Broken As Designed; it
exists only so that systems with characters as a primitive datatype can
do something analogous to C's toupper and tolower.

But in a post-Unicode world, it's a mistake to think of characters as a
primitive type, strings as a sequence of characters, and string functions
as defined by lifting character functions over the sequence.

(If I had my druthers, R6RS wouldn't have a character type at all.)

> I would expect the stricmp-equivalent variants of string comparison to use
> the algorithm that characters use rather than the one it currently uses.

Nobody should use that algorithm, ever.

> In the end, are the existing functions *really* useful? Honestly, I can't
> think how. The string-upcase example is cute, but the case-insensitive
> comparison functions that use it are useless for any serious work. They
> have a semblance of locale awareness, but they aren't locale aware and
> that fact would show through rather quickly. In fact, it even shows
> through in case-folding: (string-downcase "STRASSE") ===> "strasse", not
> "stra??e". Whether that's right or wrong depends on where you are.

Expecting a mere computer program to get this right is unreasonable.
Does "POLISH STEEL" downcase to "polish steel" or "Polish steel"?
All existing systems will answer "polish", but that may not be The
Right Thing. In the pre-1996 German orthography (still in use by many
publishers), "Flo??" at the beginning of a sentence could be the noun
"Flo??" or the past-tense verb "floss". You have to know not only German
locale rules but German itself to figure out which.

In any event, case-folding is just part of the issues around collation,
where there are always three choices: use the raw numeric ordering,
use a locale-independent ordering that gets most things right and a few
things in each locale wrong, and use a fully localized ordering.

For handling human-readable text fully, you need fully localized routines,
but there are often technical reasons to do case-blind comparisons,
and for them a simple universal approach will work.

> I think it would make more sense for R6RS to define a full set of truly
> locale-aware functions and place them in a separate (and optional)
> library.

I don't think locale-awareness belongs in the standard at all.
A SRFI is the proper place for it.

-- 
There is / One art                      John Cowan <cowan_at_ccil.org>
No more / No less                       http://www.ccil.org/~cowan
To do / All things
With art- / Lessness                     -- Piet Hein
Received on Wed Feb 14 2007 - 14:08:29 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC