Lexical syntax and read syntax

The syntax of Scheme code is organized in three levels:

  1. the lexical syntax that describes how a program text is split into a sequence of lexemes,

  2. the read syntax, formulated in terms of the lexical syntax, that structures the lexeme sequence as a sequence of syntactic data, where a syntactic datum is a recursively structured entity,

  3. the program syntax formulated in terms of the read syntax, imposing further structure and assigning meaning to syntactic data.

Syntactic data (also called external representations) double as a notation for data, and Scheme’s (rnrs io ports (6)) library (library section on “Port I/O”) provides the get-datum and put-datum procedures for reading and writing syntactic data, converting between their textual representation and the corresponding values. Each syntactic datum uniquely determines a corresponding datum value. A syntactic datum can be used in a program to obtain the corresponding datum value using quote (see section 9.5.1).

Scheme source code consists of syntactic data and (non-significant) comments. Syntactic data in Scheme source code are called forms. Consequently, Scheme’s syntax has the property that any sequence of characters that is a form is also a syntactic datum representing some object. This can lead to confusion, since it may not be obvious out of context whether a given sequence of characters is intended to denote data or program. It is also a source of power, since it facilitates writing programs such as interpreters and compilers that treat programs as data (or vice versa). A form nested inside another form is called a subform.

A datum value may have several different external representations. For example, both “#e28.000” and “#x1c” are syntactic data representing the exact integer object 28, and the syntactic data “(8 13)”, “( 08 13 )”, “(8 . (13 . ()))” all represent a list containing the exact integer objects 8 and 13. Syntactic data that denote equal objects (in the sense of equal?; see section 9.6) are always equivalent as forms of a program.

Because of the close correspondence between syntactic data and datum values, this report sometimes uses the term datum to denote either a syntactic datum or a datum value when the exact meaning is apparent from the context.

An implementation is not permitted to extend the lexical or read syntax in any way, with one exception: it need not treat the syntax #!<identifier>, for any <identifier> (see section 3.2.4) that is not r6rs, as a syntax violation, and it may use specific #!-prefixed identifiers as flags indicating that subsequent input contains extensions to the standard lexical or read syntax. The syntax #!r6rs may be used to signify that the input afterward is written with the lexical syntax and read syntax described by this report when no other #!<identifier> appears; #!r6rs is otherwise treated as a comment; see section 3.2.3.

This chapter overviews and provides formal accounts of the lexical syntax and the read syntax.

3.1  Notation

The formal syntax for Scheme is written in an extended BNF. Non-terminals are written using angle brackets. Case is insignificant for non-terminal names.

All spaces in the grammar are for legibility. <Empty> stands for the empty string.

The following extensions to BNF are used to make the description more concise: <thing>* means zero or more occurrences of <thing>, and <thing>+ means at least one <thing>.

Some non-terminal names refer to the Unicode scalar values of the same name: <character tabulation> (U+0009), <linefeed> (U+000A), <carriage return> (U+000D), <line tabulation> (U+000B), <form feed> (U+000C), <carriage return> (U+000D), <space> (U+0020), <next line> (U+0085), <line separator> (U+2028), and <paragraph separator> (U+2029).

3.2  Lexical syntax

The lexical syntax determines how a character sequence is split into a sequence of lexemes, omitting non-significant portions such as comments and whitespace. The character sequence is assumed to be text according to the Unicode standard [27]. Some of the lexemes, such as number representations, identifiers, strings etc., of the lexical syntax are syntactic data in the read syntax, and thus represent data. Besides the formal account of the syntax, this section also describes what datum values are denoted by these syntactic data.

The lexical syntax, in the description of comments, contains a forward reference to <datum>, which is described as part of the read syntax. Being comments, however, these <datum>s do not play a significant role in the syntax.

Case is significant except in boolean data, number representations, and hexadecimal numbers denoting Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.

3.2.1  Formal account

<Interlexeme space> may occur on either side of any lexeme, but not within a lexeme.

Identifiers, number representations, characters, booleans, and dot must be terminated by a <delimiter> (e.g., parenthesis, space, or comment) or by the end of the input.

The following two characters are reserved for future extensions to the language: { }

<lexeme> → <identifier> | <boolean> | <number>
         | <character> | <string>
         | ( | ) | [ | ] | #( | #vu8( | | | , | ,@ | .
         | # | # | #, | #,@
<delimiter> → <interlexeme space> | ( | ) | [ | ] | " | ;
<whitespace> → <character tabulation>
         | <linefeed> | <line tabulation> | <form feed>
         | <carriage return> | <next line>
         | <any character whose category is Zs, Zl, or Zp>
<line ending> → <linefeed> | <carriage return>
         | <carriage return> <linefeed> | <next line>
         | <carriage return> <next line> | <line separator>
<comment> → ; ⟨all subsequent characters up to a
         <line ending> or <paragraph separator>⟩
         | <nested comment>
         | #; <interlexeme space> <datum>
         | #!r6rs
<nested comment> → #| <comment text>
         <comment cont>* |#
<comment text> → ⟨character sequence not containing
         #| or |#
<comment cont> → <nested comment> <comment text>
<atmosphere> → <whitespace> | <comment>
<interlexeme space> → <atmosphere>*

<identifier> → <initial> <subsequent>*
         | <peculiar identifier>
<initial> → <constituent> | <special initial>
         | <inline hex escape>
<letter> → a | b | c | ... | z
         | A | B | C | ... | Z
<constituent> → <letter>
         | ⟨any character whose Unicode scalar value is greater than
             127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn,
             Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co⟩
<special initial> → ! | $ | % | & | * | / | : | < | =
         | > | ? | ^ | _ | ~
<subsequent> → <initial> | <digit>
         | <any character whose category is Nd, Mc, or Me>
         | <special subsequent>
<digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<hex digit> → <digit>
         | a | A | b | B | c | C | d | D | e | E | f | F
<special subsequent> → + | - | . | @
<inline hex escape> → \x<hex scalar value>;
<hex scalar value> → <hex digit>+
<peculiar identifier> → + | - | ... | -> <subsequent>*
<boolean> → #t | #T | #f | #F
<character> → #\<any character>
         | #\<character name>
         | #\x<hex scalar value>
<character name> → nul | alarm | backspace | tab
         | linefeed | newline | vtab | page | return
         | esc | space | delete
<string> → " <string element>* "
<string element> → <any character other than " or \>
         | \a | \b | \t | \n | \v | \f | \r
         | \" | \\
         | \<line ending> | \<space>
         | <inline hex escape>

A <hex scalar value> represents a Unicode scalar value between 0 and #x10FFFF, excluding the range [#xD800, #xDFFF].

The rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> below should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that number representations containing decimal points or exponents must be in decimal radix.

<number> → <num 2> | <num 8>
         | <num 10> | <num 16>
<num R> → <prefix R> <complex R>
<complex R> → <real R> | <real R> @ <real R>
         | <real R> + <ureal R> i | <real R> - <ureal R> i
         | <real R> + <naninf> i | <real R> - <naninf> i
         | <real R> + i | <real R> - i
         | + <ureal R> i | - <ureal R> i
         | + <naninf> i | - <naninf> i
         | + i | - i
<real R> → <sign> <ureal R>
         | + <naninf> | - <naninf>
<naninf> → nan.0 | inf.0
<ureal R> → <uinteger R>
         | <uinteger R> / <uinteger R>
         | <decimal R> <mantissa width>
<decimal 10> → <uinteger 10> <suffix>
         | . <digit 10>+ #* <suffix>
         | <digit 10>+ . <digit 10>* #* <suffix>
         | <digit 10>+ #+ . #* <suffix>
<uinteger R> → <digit R>+ #*
<prefix R> → <radix R> <exactness>
         | <exactness> <radix R>

<suffix> → <empty>
         | <exponent marker> <sign> <digit 10>+
<exponent marker> → e | E | s | S | f | F
         | d | D | l | L
<mantissa width> → <empty>
         | | <digit 10>+
<sign> → <empty> | + | -
<exactness> → <empty>
         | #i| #I | #e| #E
<radix 2> → #b| #B
<radix 8> → #o| #O
<radix 10> → <empty> | #d | #D
<radix 16> → #x| #X
<digit 2> → 0 | 1
<digit 8> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<digit 10> → <digit>
<digit 16> → <hex digit>

3.2.2  Line endings

Line endings are significant in Scheme in single-line comments (see section 3.2.3) and within string literals. In Scheme source code, any of the line endings in <line ending> marks the end of a line. Moreover, the two-character line endings <carriage return> <linefeed> and <carriage return> <next line> each count as a single line ending.

In a string literal, a line ending not preceded by a \ denotes a linefeed character, which is the standard line-ending character of Scheme.

3.2.3  Whitespace and comments

Whitespace characters are spaces, linefeeds, carriage returns, character tabulations, form feeds, line tabulations, and any other character whose category is Zs, Zl, or Zp. Whitespace is used for improved readability and as necessary to separate lexemes from each other. Whitespace may occur between any two lexemes, but not within a lexeme. Whitespace may also occur inside a string, where it is significant.

The lexical syntax includes several comment forms. In all cases, comments are invisible to Scheme, except that they act as delimiters, so, for example, a comment cannot appear in the middle of an identifier or number representation.

A semicolon (;) indicates the start of a line comment.The comment continues to the end of the line on which the semicolon appears.

Another way to indicate a comment is to prefix a <datum> (cf. section 3.3.1) with #;, possibly with <interlexeme space> before the <datum>. The comment consists of the comment prefix #; and the <datum> together. This notation is useful for “commenting out” sections of code.

Block comments may be indicated with properly nested #|and |# pairs.

#|
   The FACT procedure computes the factorial
   of a non-negative integer.
|#
(define fact
  (lambda (n)
    ;; base case
    (if (= n 0)
        #;(= n 1)
        1       ; identity of *
        (* n (fact (- n 1))))))

The lexeme #!r6rs, which signifies that the program text that follows is written with the lexical and read syntax described in this report, is also otherwise treated as a comment.

3.2.4  Identifiers

Most identifiersallowed by other programming languages are also acceptable to Scheme. In general, a sequence of letters, digits, and “extended alphabetic characters” is an identifier when it begins with a character that cannot begin a number representation. In addition, +, -, and ... are identifiers, as is a sequence of letters, digits, and extended alphabetic characters that begins with the two-character sequence ->. Here are some examples of identifiers:

lambda         q                soup
list->vector   +                V17a
<=             a34kTMNs         ->-
the-word-recursion-has-many-meanings

Extended alphabetic characters may be used within identifiers as if they were letters. The following are extended alphabetic characters:

! $ % & * + - . / : < = > ? @ ^ _ ~ 

Moreover, all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within identifiers. In addtion, any character can be used within an identifier when denoted via an <inline hex escape>. For example, the identifier H\x65;llo is the same as the identifier Hello, and the identifier \x3BB; is the same as the identifier λ.

Any identifier may be used as a variableor as a syntactic keyword(see sections 4.2 and 6.3.2) in a Scheme program. Any identifier may also be used as a syntactic datum, in which case it denotes a symbol(see section 9.11).

3.2.5  Booleans

The standard boolean objects for true and false are written as #t and #f.

3.2.6  Characters

Characters are written using the notation #\<character>or #\<character name> or #\x<hex scalar value>.

For example:

#\a lower case letter a
#\A upper case letter A
#\( left parenthesis
#\ space character
#\nul U+0000
#\alarm U+0007
#\backspace U+0008
#\tab U+0009
#\linefeed U+000A
#\newline U+000A
#\vtab U+000B
#\page U+000C
#\return U+000D
#\esc U+001B
#\space U+0020
preferred way to write a space
#\delete U+007F
#\xFF U+00FF
#\x03BB U+03BB
#\x00006587 U+6587
#\λ U+03BB
#\x0001z &lexical exception
#\λx &lexical exception
#\alarmx &lexical exception
#\alarm x U+0007
followed by x
#\Alarm &lexical exception
#\alert &lexical exception
#\xA U+000A
#\xFF U+00FF
#\xff U+00FF
#\x ff U+0078
followed by another datum, ff
#\x(ff) U+0078
followed by another datum,
a parenthesized ff
#\(x) &lexical exception
#\(x &lexical exception
#\((x) U+0028
followed by another datum,
parenthesized x
#\x00110000 &lexical exception
out of range
#\x000000001 U+0001
#\xD800 &lexical exception
in excluded range

(The notation &lexical exception means that the line in question is a lexical syntax violation.)

Case is significant in #\<character>, and in #\⟨character name⟩, but not in #\x<hex scalar value>. A <character> must be followed by a <delimiter> or by the end of the input. This rule resolves various ambiguous cases involving named characters, requiring, for example, the sequence of characters “#\space” to be interpreted as the space character rather than as the character “#\s” followed by the identifier “pace”.

Note:   The #\newline notation is retained for backward compatibility. Its use is deprecated; #\linefeed should be used instead.

3.2.7  Strings

String are written as sequences of characters enclosed within doublequotes ("). Within a string literal, various escape sequencesdenote characters other than themselves. Escape sequences always start with a backslash (\):

These escape sequences are case-sensitive, except that the alphabetic digits of a <hex scalar value> can be uppercase or lowercase.

Any other character in a string after a backslash is an error. Except for a line ending, any character outside of an escape sequence and not a doublequote stands for itself in the string literal. For example the single-character string "λ" (doublequote, a lower case lambda, doublequote) denotes the same string literal as "\x03bb;". A line ending stands for a linefeed character.

Examples:

"abc" U+0061, U+0062, U+0063
"\x41;bc" "Abc" ; U+0041, U+0062, U+0063
"\x41; bc" "A bc"
U+0041, U+0020, U+0062, U+0063
"\x41bc;" U+41BC
"\x41" &lexical exception
"\x;" &lexical exception
"\x41bx;" &lexical exception
"\x00000041;" "A" ; U+0041
"\x0010FFFF;" U+10FFFF
"\x00110000;" &lexical exception
out of range
"\x000000001;" U+0001
"\xD800;" &lexical exception
in excluded range
"A
bc" U+0041, U+000A, U+0062, U+0063
if no space occurs after the A

3.2.8  Numbers

The syntax of written representations for number objects is described formally by the <number> rule in the formal grammar. Case is not significant in numerical constants.

A number representation may be written in binary, octal, decimal, or hexadecimal by the use of a radix prefix. The radix prefixes are #b(binary), #o(octal), #d(decimal), and #x(hexadecimal). With no radix prefix, a number representation is assumed to be expressed in decimal.

A numerical constant may be specified to be either exact or inexact by a prefix. The prefixes are #efor exact, and #ifor inexact. An exactness prefix may appear before or after any radix prefix that is used. If the written representation of a number has no exactness prefix, the constant is inexact if it contains a decimal point, an exponent, a “#” character in the place of a digit, or a nonempty mantissa width; otherwise it is exact.

In systems with inexact number objects of varying precisions, it may be useful to specify the precision of a constant. For this purpose, numerical constants may be written with an exponent marker that indicates the desired precision of the inexact representation. The letters s, f, d, and l specify the use of short, single, double, and long precision, respectively. (When fewer than four internal inexact representations exist, the four size specifications are mapped onto those available. For example, an implementation with two internal representations may map short and single together and long and double together.) In addition, the exponent marker e specifies the default precision for the implementation. The default precision has at least as much precision as double, but implementations may wish to allow this default to be set by the user.

3.1415926535898F0 
       Round to single, perhaps 3.141593
0.6L0
       Extend to long, perhaps .600000000000000

A number representation with nonempty mantissa width, x|p, represents the best binary floating-point approximation of x using a p-bit significand. For example, 1.1|53 is a representation of the best approximation of 1.1 in IEEE double precision. If x is an external representation of an inexact real number object that contains no vertical bar, it should be treated as if specified with a mantissa width of 53.

Implementations that use binary floating point representations of real number objects should represent x|p using a p-bit significand if practical, or by a greater precision if a p-bit significand is not practical, or by the largest available precision if p or more bits of significand are not practical within the implementation.

Note:   The precision of a significand should not be confused with the number of bits used to represent the significand. In the IEEE floating point standards, for example, the significand’s most significant bit is implicit in single and double precision but is explicit in extended precision. Whether that bit is implicit or explicit does not affect the mathematical precision. In implementations that use binary floating point, the default precision can be calculated by calling the following procedure:

(define (precision)
  (do ((n 0 (+ n 1))
       (x 1.0 (/ x 2.0)))
    ((= 1.0 (+ 1.0 x)) n)))

Note:   When the underlying floating-point representation is IEEE double precision, the |p suffix should not always be omitted: Denormalized floating-point representations have diminished precision, and therefore their external representation should carry a |p suffix with the actual width of the significand.

The literals +inf.0 and -inf.0 represent positive and negative infinity, respectively. The +nan.0 literal represents the NaN that is the result of (/ 0.0 0.0), and may represent other NaNs as well.

If x is an external representation of an inexact real number object and contains no vertical bar and no exponent marker other than e, the inexact real number object it represents is a flonum (see library section on “Flonums”). Some or all of the other external representations of inexact real number objects may also denote flonums, but that is not required by this report.

3.3  Read syntax

The read syntax describes the syntax of syntactic datain terms of a sequence of <lexeme>s, as defined in the lexical syntax.

Syntactic data include the lexeme data described in the previous section as well as the following constructs for forming compound data:

3.3.1  Formal account

The following grammar describes the syntax of syntactic data in terms of various kinds of lexemes defined in the grammar in section 3.2:

<datum> → <lexeme datum>
         | <compound datum>
<lexeme datum> → <boolean> | <number>
         | <character> | <string> | <symbol>
<symbol> → <identifier>
<compound datum> → <list> | <vector> | <bytevector>
<list> → (<datum>*) | [<datum>*]
         | (<datum>+ . <datum>) | [<datum>+ . <datum>]
         | <abbreviation>
<abbreviation> → <abbrev prefix> <datum>
<abbrev prefix> → ’ | ‘ | , | ,@ | #’ | #‘ | #, | #,@
<vector> → #(<datum>*)
<bytevector> → #vu8(<u8>*)
<u8> → ⟨any <number> representing an exact
                   integer in {0, ..., 255}⟩

3.3.2  Pairs and lists

List and pair data, denoting pairs and lists of values (see section 9.10) are written using parentheses or brackets. Matching pairs of brackets that occur in the rules of <list> are equivalent to matching pairs of parentheses.

The most general notation for Scheme pairs as syntactic data is the “dotted” notation (<datum1> . <datum2>) where <datum1> is the representation of the value of the car field and <datum2> is the representation of the value of the cdr field. For example (4 . 5) is a pair whose car is 4 and whose cdr is 5.

A more streamlined notation can be used for lists: the elements of the list are simply enclosed in parentheses and separated by spaces. The empty listis written () . For example,

(a b c d e)

and

(a . (b . (c . (d . (e . ())))))

are equivalent notations for a list of symbols.

The general rule is that, if a dot is followed by an open parenthesis, the dot, open parenthesis, and matching closing parenthesis can be omitted in the external representation.

The sequence of characters “(4 . 5)” is the external representation of a pair, not an expression that evaluates to a pair. Similarly, the sequence of characters “(+ 2 6)” is not an external representation of the integer 8, even though it is a base-library expression evaluating to the integer 8; rather, it is a syntactic datum representing a three-element list, the elements of which are the symbol + and the integers 2 and 6.

3.3.3  Vectors

Vector data, denoting vectors of values (see section 9.14), are written using the notation #(<datum> ...). For example, a vector of length 3 containing the number zero in element 0, the list (2 2 2 2) in element 1, and the string "Anna" in element 2 can be written as following:

#(0 (2 2 2 2) "Anna")

This is the external representation of a vector, not a base-library expression that evaluates to a vector.

3.3.4  Bytevectors

Bytevector data, denoting bytevectors (see library chapter on “Bytevectors”), are written using the notation #vu8(<u8> ...), where the <u8>s represent the octets of the bytevector. For example, a bytevector of length 3 containing the octets 2, 24, and 123 can be written as follows:

#vu8(2 24 123)

This is the external representation of a bytevector, and also an expression that evaluates to a bytevector.

3.3.5  Abbreviations

<datum>     
<datum>     
,<datum>     
,@<datum>     
#’<datum>     
#<datum>     
#,<datum>     
#,@<datum>     

Each of these is an abbreviation:
    <datum> for (quote <datum>),
    <datum> for (quasiquote <datum>),
    ,<datum> for (unquote <datum>),
    ,@<datum> for (unquote-splicing <datum>),
    #’<datum> for (syntax <datum>),
    #‘<datum> for (quasisyntax <datum>),
    #,<datum> for (unsyntax <datum>), and
    #,@<datum> for (unsyntax-splicing <datum>).