[R6RS] R6RS Unicode SRFI controversial issues
Marc Feeley
feeley
Fri Jun 17 18:12:00 EDT 2005
I'm writing the R6RS Unicode SRFI and have encountered a few issues
which I think may be controversial (I won't mention the other less
controversial stuff to save bandwidth). Please give your opinion on
the following:
String literal escapes:
1) Matthew suggested using \<newline> (i.e. a backslash at the
end of a line)
in a string literal to indicate that the string continues on
the next line.
I believe CommonLisp has this too, and it ignores all the
whitespace
following the newline (so that strings can be indented).
Should R6RS do
the same? Moreover, should R6RS prohibit newlines in strings
that are not
preceded by a backslash? My position is yes on both questions.
2) Matthew suggested using the \u<x>...<x> (with <= 4 hex digits
<x>) and
\U<x>...<x> (with <= 6 hex digits <x>) escapes. The
\u<x>...<x> escape
is similar but not exactly the same as Java's. Java requires
exactly
4 hex digits. Moreover, Java transforms the \u<x>...<x>
escapes to
Unicode characters before lexical analysis (so that "!\u0022
is a
valid string since \u0022 represents the closing
doublequote). I propose
that the \u<x>...<x> escape require exactly 4 hex digits and
that
the handling of \u<x>...<x> escapes be done as part of the
string
parser, i.e. "!\u0022 is not a valid string, but "!\u0022" is
equivalent to "!\"". For consistency, I propose that the
\U<x>...<x> escape require exactly 6 hex digits. That makes the
syntax easy to remember:
\<o><o><o> : range 0 to #xFF (as in C)
\x<x><x> : range 0 to #xFF (as in C)
\u<x><x><x><x> : range 0 to #xFFFF excluding
#xD800..#xDFFF
\U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding
#xD800..#xDFFF
Character literals:
1) #\newline is defined as the Unicode character 10, which has
traditionally
been called linefeed. I suggest we add #\linefeed and that
(char->integer #\newline) = (char->integer #\linefeed) = 10.
2) For consistency with the string literal escapes the character
literal
syntax in R6RS should support a 2/4/6 digit hexadecimal
notation:
#\x<x><x> : range 0 to #xFF
#\u<x><x><x><x> : range 0 to #xFFFF excluding
#xD800..#xDFFF
#\U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding
#xD800..#xDFFF
The octal notation should not be supported because #\0 to #\7
would
be ambiguous (and making a special case for the single digit
case
would be ugly and error prone).
3) The named characters should be followed by a delimiter, so that
the datum (#\spaceous) is an error instead of being
equivalent to
the two element list (#\space ous) as in R5RS.
4) For consistency with the case sensitivity of symbols (and the
fact that
case is significant to distinguish #\u... and #\U...), the named
characters should also be case sensitive. #\Space should be
an error.
By the same reasoning, booleans should also be case
sensitive, so that
#F and #T are errors.
Marc
More information about the R6RS
mailing list