[r6rs-discuss] [Formal] Improve port i/o.

From: William D Clinger <will>
Date: Mon Nov 13 21:19:35 2006

---
This message is a formal comment which was submitted to formal-comment_at_r6rs.org, following the requirements described at: http://www.r6rs.org/process.html
---
Submitter: William D Clinger
Email address: will_at_ccs.neu.edu
Issue type: Defect
Priority: Major
Component: I/O
Report version: 5.91
Summary: Improve port i/o.
Full description of issue:
Section 15.3 of the draft R6RS describes a design for
port i/o that was based on my misunderstanding of the
requirements.  In particular, it was designed to allow
arbitrary mixing of binary and textual i/o for a small
set of Unicode character encodings, but does not
generalize well to the large set of encodings that are
currently in use.
The real requirements appear to be:
 *  Support efficient binary i/o.
 *  Support efficient text i/o.
 *  Provide a small set of standard transcoders, while
    allowing implementations to provide others, including
    transcoders with arbitrarily weird semantics.
 *  Support conversion of binary ports into text ports,
    mainly to support use cases such as input from XML
    files, where the transcoder is determined by reading
    a small prefix of the file.
The first three of those requirements can be satisfied
by a more conventional design.
The fourth requirement can be met by a procedure that
accepts a binary port as argument and returns a text
port that consumes bytes from the binary port while
transcoding them into characters.
The rest of this comment suggests a better design, and
then describes some outstanding issues for which I have
no strong recommendation at this time.
                                * * *
The main ideas of this alternative design are to distinguish
binary from text files, and to forbid compositions of
transcoders.  Composition of transcoders is well-defined
in a mathematical sense, but the composition of two
transcoders is unlikely to be useful.
Those ideas run counter to the ideas of SRFI 81, which
was a starting point for section 15.3 of the draft report.
Other aspects of the suggested design include:
 *  A transcoder is an immutable description (think of
    it as a factory method for manufacturing transcoding
    objects) of some possibly stateful algorithm for
    translating sequences of bytes into sequences of
    characters and vice versa.
 *  Every transcoder can operate in the input direction
    (bytes to characters) or in the output direction
    (characters to bytes), but the composition of those
    directions need not be identity (and often isn't).
    (See [issue:bidirectional].)
 *  Transcoders are never composed, so there is no reason
    to define the composition of two transcoders.
 *  The standard transcoders are constructed from codecs,
    eol styles, and handling modes as described in section
    15.3 of the draft R6RS.
 *  The standard codecs of Scheme include:
        latin-1-codec
        utf-8-codec
        utf-16-codec
        utf-32-codec
 *  That list of standard codecs includes three of the
    seven Unicode character encoding schemes, but omits
    UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE on the
    grounds that Scheme programmers should be encouraged
    to use codecs that use and interpret a byte-order-mark
    (BOM) or its absence as specified by the Unicode
    standard.  (See [issue:BOM].)
 *  Implementations may support other codecs, eol styles,
    and other kinds of transcoders.  In particular, they
    may support Unicode character encoding schemes that
    interpret a BOM as a ZERO WIDTH NO-BREAK SPACE, a
    noncharacter, or as a private use character.
 *  The binary transcoder is a special pseudo-transcoder
    that is returned by the binary-transcoder procedure
    (which would be added to the procedures described in
    section 15.3).  Every binary transcoder is eqv? to
    every binary transcoder (but not necessarily eq?),
    and is not eqv? to any transcoder that is returned
    by the make-transcoder procedure.  The transcoder-codec,
    transcode-eol-style, and transcoder-error-handling-mode
    procedures return #f when given a binary transcoder as
    their argument.
 *  A binary port is a port whose transcoder is the
    binary transcoder.
 *  Binary ports are created by passing the binary
    transcoder to an open-X procedure, or by calling an
    open-bytes-X or call-with-bytes-X procedure with no
    transcoder argument.
 *  The binary lookahead-X, get-X, and put-X operations
    (which have "byte" or "bytes" in their names) operate
    only on binary ports.
 *  A text port is a port whose transcoder is not the
    binary transcoder.
 *  Text ports are creating by passing a transcoder other
    than the binary transcoder to an open-X procedure, or
    by calling an open-X procedure without a transcoder
    argument (provided the open-X procedure is not one
    of those whose standard name contains "bytes").
 *  The textual lookahead-X, get-X, and put-X operations
    operate only on text ports.  They do not accept a
    transcoder as an argument.
 *  A new procedure, transcoded-port, takes a binary port
    and a transcoder as arguments and returns a new text
    port whose state is largely that of the binary port
    but whose transcoder is the newly specified transcoder.
 *  To prevent interference between operations on the
    original binary port and buffering of transcoded
    characters on the text port created by transcoded-port,
    the original binary port is closed when the derived text
    port is created.
    (Implementation note: the original binary port can be
    cloned, the cloned port encapsulated within the derived
    text port, and then the original port closed in a
    special way that doesn't release resources needed by
    its clone.)
 *  If no optional transcoder argument is passed to an
    open-file-X procedure, then a text port is returned
    but the transcoder associated with that text port is
    not otherwise specified.  (See [issue:locale].)
 *  The port-position and set-port-position! procedures
    are required only for binary ports that were created
    by an open-X procedure.  (See [issue:position].)
 *  The open-X procedures may raise an exception if
    the specified transcoder is not supported for the
    kind of port being opened.
 *  To simplify the process of reading individual characters
    a binary port, the R6RS should provide something like
    get-char-from-binary and lookahead-char-from-binary,
    which would take a binary port and a transcoder as
    arguments.  (See [issue:lookahead].)
 *  The various procedures that are associated with bytes
    and string ports would also change.  The changes for
    those procedures are contingent upon acceptance of the
    design sketched above, so I will not try to suggest
    any detailed specification for those procedures in this
    comment, except to note that transcode-bytes and
    transcode-string procedures should be provided to
    simplify translations from bytes to strings and vice
    versa.
                                * * *
Issues:
[issue:bidirectional]
    Transcoding algorithms are unidirectional (bytes to
    characters or characters to bytes), but are usually
    named in pairs that are near-inverses of each other.
[issue:BOM]
    While I'm all for encouraging programmers to use the
    Unicode character encodings that interpret byte order
    marks as specified by the Unicode standard, I worry
    about documents that implicitly use or explicitly
    specify UTF-16LE or UTF-32LE, which cannot be read
    using the UTF-16 or UTF-32 codecs.  If few documents
    actually use UTF-16LE or UTF-32LE, then this is not
    much of a concern.
[issue:locale]
    Implementations of Scheme will be in a much better
    position than the R6RS to guess the transcoding that
    is appropriate for a text file, so the R6RS should
    not insist upon any particular transcoding when none
    is specified by the call to an open-X procedure.
[issue:position]
    Asking for the byte position of a complexly transcoded
    port can be like asking for the carrier frequency of a
    spread spectrum signal, and I am told that some standard
    encodings do not always align the encodings of characters
    upon byte boundaries, so the port-position operation
    should be required only for binary ports, if at all.
[issue:lookahead]
    For a more general approach to this problem, see
    http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/000646.html
    (See also [issue:readers].)
[issue:readers]
    The readers described in section 15.2 of the draft R6RS
    might seem relevant to the problem of providing ports
    with arbitrary lookahead, but they can't solve that
    problem because they aren't ports.  It seems as though
    the right thing to do may be to eliminate readers and
    writers from the report, while folding their functions
    into ports that represent arbitrary sources and sinks.
    That might be too radical for R6RS, but dropping readers
    and writers from the R6RS would clear the way for a more
    general solution in R7RS.
[end of comment]
Received on Mon Nov 13 2006 - 10:36:49 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC