[R6RS] R6RS Unicode SRFI controversial issues

Fri Jun 17 18:12:00 EDT 2005

I'm writing the R6RS Unicode SRFI and have encountered a few issues  
which I think may be controversial (I won't mention the other less  
controversial stuff to save bandwidth).  Please give your opinion on  
the following:

   String literal escapes:

     1) Matthew suggested using \<newline> (i.e. a backslash at the  
end of a line)
        in a string literal to indicate that the string continues on  
the next line.
        I believe CommonLisp has this too, and it ignores all the  
whitespace
        following the newline (so that strings can be indented).   
Should R6RS do
        the same?  Moreover, should R6RS prohibit newlines in strings  
that are not
        preceded by a backslash?  My position is yes on both questions.

     2) Matthew suggested using the \u<x>...<x> (with <= 4 hex digits  
<x>) and
        \U<x>...<x> (with <= 6 hex digits <x>) escapes.  The  
\u<x>...<x> escape
        is similar but not exactly the same as Java's.  Java requires  
exactly
        4 hex digits.  Moreover, Java transforms the \u<x>...<x>  
escapes to
        Unicode characters before lexical analysis (so that "!\u0022  
is a
        valid string since \u0022 represents the closing  
doublequote).  I propose
        that the \u<x>...<x> escape require exactly 4 hex digits and  
that
        the handling of \u<x>...<x> escapes be done as part of the  
string
        parser, i.e. "!\u0022 is not a valid string, but "!\u0022" is
        equivalent to "!\"".  For consistency, I propose that the
        \U<x>...<x> escape require exactly 6 hex digits.  That makes the
        syntax easy to remember:

         \<o><o><o>           : range 0 to #xFF (as in C)
         \x<x><x>             : range 0 to #xFF (as in C)
         \u<x><x><x><x>       : range 0 to #xFFFF excluding  
#xD800..#xDFFF
         \U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding  
#xD800..#xDFFF

   Character literals:

     1) #\newline is defined as the Unicode character 10, which has  
traditionally
        been called linefeed.  I suggest we add #\linefeed and that
        (char->integer #\newline) = (char->integer #\linefeed) = 10.

     2) For consistency with the string literal escapes the character  
literal
        syntax in R6RS should support a 2/4/6 digit hexadecimal  
notation:

         #\x<x><x>             : range 0 to #xFF
         #\u<x><x><x><x>       : range 0 to #xFFFF excluding  
#xD800..#xDFFF
         #\U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding  
#xD800..#xDFFF

        The octal notation should not be supported because #\0 to #\7  
would
        be ambiguous (and making a special case for the single digit  
case
        would be ugly and error prone).

     3) The named characters should be followed by a delimiter, so that
        the datum (#\spaceous) is an error instead of being  
equivalent to
        the two element list (#\space ous) as in R5RS.

     4) For consistency with the case sensitivity of symbols (and the  
fact that
        case is significant to distinguish #\u... and #\U...), the named
        characters should also be case sensitive.  #\Space should be  
an error.
        By the same reasoning, booleans should also be case  
sensitive, so that
        #F and #T are errors.

Marc