[R6RS] Unicode scalar value escape sequences
Michael Sperber
sperber
Tue Mar 1 02:40:34 EST 2005
>>>>> "Marc" == Marc Feeley <feeley at IRO.UMontreal.CA> writes:
>> After thinking about it a little bit more, I like Marc's proposal for
>> allowing regular number literals to specify the scalar value of a
>> character literal better than the C/Java clone:
>>
>> #\n Unicode character n (n must start with a # character and it must
>> represent an exact integer, for example #\#x20 is the space character,
>> #\#d9 is the tab character, and #\#e1.2e2 is the lower case character
>> "x")
>>
>> Of course, the downside is that this doesn't carry over directly to
>> string literals. Marc, have you thought about allowing \#<n> in
>> string literals, and requiring some kind of delimiter (like ; or # or
>> whatever) after it?
Marc> Although Gambit has supported this notation for some time now, I'm not
Marc> convinced it is really the best approach. I think a syntax that is
Marc> shared by characters and strings would be better (and have a single
Marc> unified syntax). So if this is a valid string: "\u1234" then this
Marc> should be a valid character #\u1234 . A syntax like #\#d32, while
Marc> precise and flexible, is awkward if it can't be used in strings.
Yeah, but my question was what you thought about allowing that same
notation with a delimiter in strings. After all, \u sequences with
less than 4 digits must also be delimited.
Let me turn this into a proposal to be clear:
I propose using Gambit's notation for character literals that specify
a character through its scalar value.
I propose allowing
\<number>;
where <number> must start with a # sign in string literals to denote
a character through its scalar value. (Pick any other delimiter you
like, if it's only the semicolon from liking the proposal.)
Marc> Increasingly I believe it is wrong in a high-level language like
Marc> Scheme to have both a character type and a string type. The character
Marc> data type is really low-level, archaic and motivated by performance.
Marc> Ask yourself this question: if performance was not an issue would it
Marc> be possible to do text processing (elegantly) using only the string
Marc> data type and the following primitives?
Marc> (string-length str)
Marc> (substring str start end)
Marc> (string-append str...)
Marc> (string=? str...) ; and <, <=, ...
Marc> (char->integer str) ; where str is a string of length 1
Marc> (integer->char n) ; returns a string of length 1
Marc> (read-char [port]) ; returns a string of length 1
But you still retain CHAR->INTEGER, INTEGTER->CHAR, and READ-CHAR.
(And thus, a lot of the current character predicates.) So the only
real difference in your proposal is that you'd have (string?
#\a) (or whatever the character notation is) return #t. Right?
Marc> I wonder how novice users react when confronted with the two text
Marc> related datatypes in most current languages (strings and
Marc> characters).
I think characters and strings are intuitive concepts to most human
beings, not just novice Scheme users.
--
Cheers =8-} Mike
Friede, V?lkerverst?ndigung und ?berhaupt blabla
More information about the R6RS
mailing list