[r6rs-discuss] [Formal] formal comment (ports, characters, strings, Unicode)

From: Thomas Lord <lord>
Date: Thu Mar 15 18:27:58 2007

---
This message is a formal comment which was submitted to formal-comment_at_r6rs.org, following the requirements described at: http://www.r6rs.org/process.html
---
Submitter:                 Thomas Lord
Submitter email address:   lord_at_emf.net
Type of Issue:             Simplification/Enhancement/Defect
Priority:                  major
R6RS components:           base library, concepts,
                           formal syntax, I/O, Lexical Syntax,
                           Unicode
Version of the report:     5.92
Synopsis:
  Conformant implementations should not be *required* to support
  any characters beyond the portable character set of R5RS.
  The report should define a standard way to extend beyond the
  portable character set by addition of characters corresponding
  to Unicode scalar values.
  The report should recognize and honor a role for a character
  type that transcends the specifics of Unicode and encompasses
  discrete communications channels in general.  In particular,
  the report should permit the inclusion of characters which
  do not correspond to Unicode scalar values.
  The fundamental conformance requirement of an implementation
  should explicitly pertain to observable consequences of
  running a program, principly reflected as operations on ports.
Disclaimers:
  This comment is incomplete:  some changes are indicated
  but not fully spelled out;  some needed changes (under the
  premise of this comment) have no doubt been missed;
  the proposed substitute wording is, at best, a rough first
  draft;  the notion of permitting implementations to support
  less than all of Unicode has broad implications that merit
  discussion;  the implications of the proposals herein have
  not explained, here, for the standard libraries.
Full Description:
  I propose a number of changes to the treatment of ports,
  characters and strings.
* Change to "Summary", page 1
  For
      "Chapter 2 explain's Scheme's number types"
  Substitute
      "Chapter 2 explains several of Scheme's fundamental
      types."
* Changes to "1.1 Basic Types", page 7
  Retitle:
        1.1 Fundamental Types
  For
      Characters
      Scheme characters mostly correspond to
      textual characters. More precisely, they
      are isomorphic to the scalar values of the
      Unicode standard.
      Strings
      Strings are finite sequences of characters
      with fixed length and thus represent arbitrary
      Unicode texts.
  Substitute
      Ports
      A port is an object representing one end
      of a discrete communications channel over
      which Scheme programs can transmit and/or
      receive characters selected from a finite
      alphabet associated with the port.
      Characters
      Character objects represent characters
      such as are transmitted and received
      over a communication channel associated with
      a port.   Most commonly, character objects
      correspond to Unicode scalar values and
      are used as primitive elements when representing
      textual data.
      Strings
      A string is a linear data structure representing
      a finite sequence of arbitrary characters.
      Elements of a string are addressed by an integer
      index.   For example, a Unicode text can be
      usefully represented as a string.
* Chapter 2, "Numbers", pages 10 and 11
  Retitle the chapter:  "Fundamental Types"
  Renumber the entire current content of Chapter 2, "2.1"
   (renumbering the current "2.1" to "2.1.1", etc.)
  For
        "This chapter describes Scheme's representations
         for numbers"  (page 10)
  Substitute
        "This section describes Scheme's representations
         for numbers"  (page 10)
  Add a new introduction:
        2. Fundamental Types
        This chapter explains several of Scheme's
        fundamental types.
  Add a new section:
        2.2 Ports, Characters, and Strings
        This section describes Scheme's mechanisms
        and representation for synchronous communication
        between Scheme programs and processes which are
        external to the execution of a program.   Thus, ports
        characters, and strings comprise an important
        part of Scheme's model for the formally observable
        side effects of running a program and the model
        for observations of external events which may
        effect a running program.
        Often but not always, such observable communication
        conveys textual information.  Thus, it is useful
        to first explain these types beginning with an
        abstract mathematical model of communication, and
        then to explain how that model applies specifically
        to textual information.
        2.2.1 Program Execution as World-line
              and Implementation Correctness
        Conceptually, for the purpose of understanding
        the observable consequences of running a program,
        the execution of a Scheme program corresponds to a
        relativistic world-line.   Information about events
        external to a running program become available to
        that program at a specific point on the execution's
        world-line when the program explicitly completes a
        step to receive that information.   Similarly,
        information from the running program becomes externally
        observable when explicitly transmitted at a specific
        point on the execution's world-line. 
        In portable programs, all transmissions and receipt
        of information are comprised of discrete atomic events
        -- the conveyance of a single character via a port --
        and these are totally ordered along the conceptual
        world-line of a program.   Each is a unique event.
        Implementations are permitted, however, to make
        extensions which allow for simultaneous
        transmissions and/or receipts.
        In an important sense, the transmission and receipt
        events that occur as a Scheme program runs are
        the *only* formally observable consequence of running
        the program.   An implementation is correct, in an
        important sense, provided only that these events
        occur as specified and in a permitted order when
        running a portable program.
        It should be noted that, while the order of
        communication events on the world-line of a
        running program is formally well-defined, that
        order is not directly observable.   That is to
        say that external observations of and transmissions
        to a Scheme program may occur, from the perspective
        of external observers, in a different order,
        and possibly with loss of information.  Only
        causality relationships, as imposed externally and
        as implied by execution-order rules in this report,
        define a partial ordering of communications events
        upon which all observers can, in principle, agree.
        [This section should cite the source of its conceptual
         model of communication, the paper:
           "The Mutual Exclusion Problem: Part I -- A Theory
            of Interprocess Communication", Leslie Lamport;
            Journal for the Association of Computing Machinery;
            Volume 33, Number 2, April 1986.
        ]
           
        2.2.2 Ports as Discrete Communication Channel Terminals
        Scheme adopts a mathematical model of communication
        based on discrete communication channels.  Each channel
        is associated with a finite, abstract alphabet.  The
        channel conveys letters from that alphabet in one or
        both directions, one at a time.  For example, the size
        of the alphabet, together with the number of letters
        than can be conveyed in a unit of time, determine the
        bandwidth of the channel.
        A port object represents a Scheme program's direct
        interface to one end of such a communication's channel.
        It is through a port object that a program transmits
        and receives on the channel.   It is noteworthy that a
        port represents only one terminal point on the channel:
        the physical channel itself as well as the terminal point(s)
        of external processes are not directly accessible to
        the program.
        In this model of communication, we make no a priori
        assumptions about the alphabet whose letters are
        conveyed, other than it is finite.   In particular,
        distinct ports may use different alphabets.
        When two ports use different alphabets, it is sometimes
        useful to treat the alphabets as disjoint sets and
        othertimes useful to identify letters in one alphabet
        with letters in another.   An example of the latter
        case can be seen by comparing an ASCII-only channel to
        a Unicode scalar value channel:  it is often desirable
        to treat ASCII as a subset of Unicode.   An example
        of usefully disjoint alphabets can be seen by comparing
        a Unicode channel, used to convey textual information,
        to channel used to control a certain style of traffic
        signal, on which a program wishes to transmit letters
        that correspond to "red", "yellow", and "green".
        It is, nevertheless, the case that many useful
        procedures reasonably operate generically on all
        letters, without regard to which alphabet they come
        from.   For example, if a procedure is intended to
        concatenate finite sequences of letters ("strings", in
        Scheme) the same implementation for that procedure
        suffices regardless of whether the sequence comprises
        text, traffic signals, or some mix of these.   For
        that reason, Scheme includes the fundamental type
        "character", which contains all letters from all
        alphabets supported by an implementation.
        [This section should cite the source of the mathematical
         model of communication to which it refers, such as:
           "The Mathematical Theory of Communication",
            Claude E. Shannon and Warren Weaver;
            University of Illinois Press; 1963
        ]
        2.2.3 Unicode Scalar Values: A Portable, Textual Alphabet
        This report defines certain character values which must
        be supported by all implementations and others which
        may be supported by any implementation but only in
        specified ways.   Together, these comprise the Unicode
        scalar values and they are included in Scheme so that
        portable programs may reliably manipulate textual
        information in the broadest practical range of human
        languages and, more specifically, to that portable
        Scheme program can reliably manipulate the source text
        of portable Scheme programs.
        Unicode scalar values are formally defined by an
        established but evolving standard, "The Unicode
        Standard," as published by The Unicode Consortium.
        Informally speaking, the scalar values "roughly
        correspond" to the character-like elements of
        human writing systems however, in its details the
        exact relationship to writing systems is complex and
        readers are referred to The Unicode Standard for a
        complete explanation.
        2.2.4 Character Order
        Communications channel alphabets in general, and Unicode
        in particular, are frequently defined by standards
        procedures which are external to the process which
        defines Scheme.   Frequently, as with Unicode scalar
        values, a total ordering of the letters within an
        alphabet are included in the definition.
        Consequently, Scheme includes procedures which compare
        two or more characters for their ordering.   Portable
        program may rely on Unicode scalar values being
        well-ordered and on that order corresponding to the
        definitions of The Unicode Standard.
        When characters represent letters from either an
        unordered alphabet or from disjoint alphabets, the
        ordering imposed on them may be implementation
        specific or the characters may be unordered.  Thus,
        portable programs which assume that all characters they
        encounter are well-ordered may cause errors if run
        in implementations and contexts that present these
        programs with non-portable characters.   Nevertheless,
        it is generally reasonable for portable programs that
        are concerned mainly with Unicode scalar values to
        assume that all characters they encounter will be
        well-ordered.
        2.2.5 Character Enumeration
        Similarly, external standards, The Unicode Standard
        in particular, often define a mapping from the letters
        of an abstract alphabet to (usually non-negative)
        exact integer values.
        Because of the central importance of enabling portable
        programs to reliably manipulate textual data, this
        report requires implementations to convert Unicode
        scalar values to the corresponding integer, and vice
        versa.   Implementations are permitted but not required
        to include additional characters that can be converted
        to and from integers, provided they satisfy this Unicode
        requirement.
        Implementations may include characters for which there
        is no conversion to and from integers, using the
        standard procedures defined herein.   Nevertheless,
        it is generally reasonable for portable programs that
        are concerned mainly with Unicode scalar values to
        assume that all characters they encounter will be
        convertable to and from integers.
        2.2.6 Strings and String Ordering
        Ports, by definition, convey characters, one at a time.
        It is commonly necessary, especially when textual
        information is being manipulated, to manage finite
        sequences of characters.
        Scheme's string objects represent finite sequences
        of arbitrary characters.
        When two strings are comprised entirely of well-ordered
        characters, a natural lexical ordering of the strings
        may be inferred.   In the case of characters
        corresponding to Unicode scalar values, that ordering
        is an imperfect but frequently useful approximation
        of the lexical linguistic ordering of texts.
        2.2.7 Characters, Strings, and Case Conversions
        The lexical syntax of Scheme relies upon certain very
        limited forms of case conversion among textual letters.
        These conversions are a subset of a standard,
        linguistically approximate case conversion among
        Unicode scalar values.   Scheme includes procedures
        which effect these conversions, as well as their natural
        character-wise extensions to strings.
        2.2.8 Ports, Characters, and Strings: A Summary
        Ports are communication channel end-points held by a
        running Scheme program.   Characters are letters, from
        finite abstract alphabets, conveyed over these channels.
        Strings are finite sequences of characters.
        Portable programs must restrict themselves to characters
        corresponding to Unicode scalar values.   These
        characters are well-ordered and correspond to
        standardized integer values.   A linguistically
        approximate case conversion is defined among these
        characters.
        Implementations may extend the character type (and by
        implication, the port and string types) with additional
        characters.   The full set of characters supported by an
        implementation may be well-ordered but need not be.
  [or words to similar effect]
* Chapter 3, "Lexical syntax and read syntax"
  In general, implementations should not be required to support
  more than a minimal portable character set while, at the same
  time, there should be only one permitted way to add support
  for fully general Unicode scalar value characters.
  In 3.2.1 ("Formal Account" p. 12) the definition of
  <consitutent> is too strong.
  For
        <any character whose Unicode scalar value....>
  Substitute
        <any character, supported by the implementation,
         whose Unicode scalar value ....>
  In 3.2.3, p.14:
  For
        Moreover, all characters whose...
  Substitute
        Moreover, all chacters supported by an implemtnation, whose
  Similar fixes to 3.2.5, p. 14.
  In 3.2.6, p 15, the definition of "\x" notation needs similar
  fixes.
* Chapter 4, section 4.3, "Exceptional situations", p. 18
  It is unclear whether or not it is intended to permit
  implementations to use the condition system as a means
  to asynchronously communicate information to an application.
  If so, slight changes are merited to the proposed addition of
  section 2.2 ("Ports, Characters, and Strings") above.
  [Note: it is a matter worthy of explicit debate whether or not
  the condition system should be used for asynchronous communication.]
* Chapter 9, Section 9.1, "Base Types"
  Add "port?" to the list.  
  I suggest renaming the section, "Fundamental types" because
  "base" carries too many overtones from the vocabulary of
  object oriented programming languages.
  Ports should be considered a fundamental type for reasons
  given in the proposed addition of 2.2 ("Ports, Characters, and
  Strings"), above.
* Chapter 9, Section 9.13, "Characters", p. 49
  Insert a section here introducing ports.
* Chapter 9, Section 9.13, "Characters", p. 49ff
  For
    *Characters* are objects that represent Unicode scalar
    values[46].
  Substitute
    *Characters* are objects that represent abstract
    letters from a communications channel (port) alphabet.
  For
    *Note:* Unicode defines [....] (whose code is in the
    range #x10000 to #X10FFFF).
  Substitute
    All implementations of scheme are required to support
    the characters [as per the R5 portable character set].
    Implementations should additionally support a larger
    character set corresponding to Unicode scalar values.
  For
      [the definitions of char->integer and integer->char]
  Substitute
      (char->integer /char/)            procedure
      (integer->char /int/)             procedure
 
      For characters with an integer mapping (see section
      2.2) these procedures implement a bijective mapping
      between characters and integers.   In particular,
      characters which correspond to Unicode scalar values
      must be mapped to the corresponding exact integer.
      For other characters which an implementation may
      support, these procedures have unspecified behavior
      and return values.
  For (p.50)
        These procedures impose a total ordering on the
        set of characters according to their Unicode
        scalar values.
  Substitute
        These procedures define a partial ordering among
        characters.   For characters with an integer
        mapping (as given by char->integer) the ordering
        among characters is the same as the ordering of
        the corresponding integers.
Received on Thu Mar 15 2007 - 01:57:36 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC