[R6RS] Unicode SRFI - responses needed
Matthew Flatt
mflatt
Tue Jul 19 12:12:57 EDT 2005
Here's a summary of the SRFI feedback so far, from my perspective:
* Jorgen Schaefer points out that Unicode defines a case-folding
mapping, which we should use for -ci operations instead of
downcasing. I think he's clearly right; I overlooked this mapping
before.
* Alex Shinn and others have convinced me that the SRFI should include
string-upcase', `string-downcase', and `string-titlecase', which are
locale-independent but use the mappings in "SpecialCasing.txt" to
handle a few conversions that are not 1-to-1 in scalar values.
Along the same lines, the string -ci operations should incorporate
the non-1-to-1 mappings in CaseFolding.txt. (This file provides
specific information to use for both 1-to-1 mappings and non-1-to-1
mappings.) After case-folding, the string comparisons should proceed
by comparing characters (i.e., scalar values); they should not use
the Unicode collation algorithm.
Finally, we probably want case-folding operations `char-foldcase'
and `string-foldcase'.
All of this, to me, strikes the right balance between usefulness and
ease-of-implementation.
* See
http://srfi.schemers.org/srfi-75/mail-archive/msg00084.html
where I attempt to reply to most other messages that suggest more
exotic definitions of "character", a weaker definition of
"character", or a different set of core operations.
The message content is merely my opinion, but I did my best to
reflect the opinion of the editors as a group. Whether I got it
right is something we should discuss further.
An ongoing point of discussion among a handful of people (not
including me) is whether R6RS should include any character
comparison or conversion operations. I still think it should, but we
should discuss this specifically before putting out a new draft.
* The reception for here strings is mixed. I would be happy to see
them go, at this point, just to keep things simpler. In any case,
I'd like to get a sense of the editors' opinion before producing a
second draft.
* An open question: are Scheme implementation required to support all
Unicode scalar values, or are subsets ok? I think we discussed this,
and we noted that the set of characters that fit into the 16-bit
space is closed under various operations (I'll double-check this).
Few other natural subsets are closed, apparently.
I'm inclined to require full the set of Unicode characters, and let
implementations declare that they deviate from the standard when
they support only subsets. That way, a library implementor can say
"this library works in Scheme", instead of "this library works in
Scheme that supports all Unicode characters". Meanwhile, other
libraries might be annotated "this library works even with
variations of Scheme that support only ASCII characters" when the
library implementor cares and has given the question some thought.
This question is closely related to whether R6RS nails down the
definition of character and supplies various operations at all. Many
appeal to the way that numbers are handled in R5RS to support that
idea that R6RS's requirements should be minimal. Since I see our
role as strengthening portability wherever possible, I don't agree
with this line of reasoning.
Response items:
* Does anyone doubt that we really want to pin down the definition of
character as "Unicode scalar value"? (I still don't.)
* Does anyone want to argue that supporting a subset of Unicode might
count as standard-compliant? (I think that it's not necessary to
allow this in the standard.)
* Is anyone unhappy with slightly more complex string operations that
take into account non-1-to-1 conversions? (I think I'm happy with
this, and I'll implement it today to be sure.)
* Who wants to keep character-based comparison and conversion
operations? (I do.)
* How many editors want to keep here strings? How many would prefer to
see them go? (I'm now inclined to get rid of them.)
Thanks,
Matthew
More information about the R6RS
mailing list