[r6rs-discuss] Stateful codecs and inefficient transcoding from William D Clinger on 2006-11-07 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Tue Nov 7 11:33:46 2006

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.

For the latest, up-to-the-minute list of my mistakes and
other people's ideas for fixing draft R6RS section 15.3
(port i/o), please see http://www.ccs.neu.edu/home/will/R6RS/

This message offers rationale for the most recent changes,
most of which were requested by the persons quoted below.
Those changes are summarized by the following paragraphs:

    The main ideas are to distinguish binary from text files,
    and to forbid compositions of transcoders. Composition
    of transcoders is well-defined in a mathematical sense,
    but the composition of two transcoders is unlikely to be
    useful.

    Those ideas run counter to the ideas of SRFI 81, which
    was a starting point for section 15.3 of the draft report.

                                * * *

John Cowan wrote:
> > * The binary transcoder is defined as the transcoder
> > constructed from the latin-1-codec, lf eof-style,
> > and (arbitrarily, since no transcoding errors are
> > possible) the raise handling mode.
>
> No, no, a thousand times no!

Amen! That item has been changed to:

* The binary transcoder is a special pseudo-transcoder
    that is returned by the binary-transcoder procedure
    (which would be added to the procedures described in
    section 15.3)....

The following revisions are also relevant:

* A binary port is a port whose transcoder is the
    binary transcoder.

* Binary ports are created by passing the binary
    transcoder to an open-X procedure, or by calling an
    open-bytes-X or call-with-bytes-X procedure with no
    transcoder argument.

* The binary lookahead-X, get-X, and put-X operations
    (which have "byte" or "bytes" in their names) operate
    only on binary ports.

* A text port is a port whose transcoder is not the
    binary transcoder.

* Text ports are creating by passing a transcoder other
    than the binary transcoder to an open-X procedure, or
    by calling an open-X procedure without a transcoder
    argument (provided the open-X procedure is not one
    of those whose standard name contains "bytes").

* The textual lookahead-X, get-X, and put-X operations
    operate only on text ports....
    They do not accept a transcoder as an argument.

Andrew Pochinsky wrote:
> Shouldn't it be a bit different:
> for a composition of t1 and t2 the input transcoder is t1input
> followed by t1output followed by t2input, and the output transcoder
> is t2output followed by t1input followed by t1output ?
> If defined this way then for the case then t1 and t2 are invertable
> (input followed by output is an identity and vice a versa), so is
> their composition.

That makes at least as much sense as my definition. The
problem is that no one seems to know of any use cases in
which compositions of transcoders are useful. Note that
the semantics I described for transcoded-port (using my
third, corrected definition of composition) does not work
for the XML use case. I can't think of any cases for
which that semantics is likely to be useful, for input or
output, with my definition of composition or with yours.

As John Cowan put it:

> I'm still waiting for a use-case involving stacked transcoders where
> neither one is the identity transcoder.

Marcin Kowalczyk wrote:
> In my design the one-direction transcoder is the more fundamental
> concept. Character encodings have names for convenience; there is a
> mapping between encoding names and encoders, and between encoding
> names and decoders.

You are right, but my notes still use two-way transcoders
as in the draft R6RS, mainly because it seems a little
simpler and I didn't want to fiddle with too many details
until the main ideas are repaired.

In any case, it is common practice to use a single name
for a pair, and that convention should be supported even
if there were a standard way to express a unidirectional
mapping as a Scheme value. (The current draft R6RS does
not forbid that, it just doesn't provide a way to do it.)

Marcin Kowalczyk wrote:
> > * To prevent interference between operations on the
> > original port and operations on the port created by
> > transcoded-port, the original port is closed when
> > the derived port is created.
>
> Such interference has a purpose: this a way for mixing text and binary
> i/o on the same stream, or for using multiple encodings (other than
> extracting byte arrays and transcoding them separately).
>
> Unfortunately this interference is delicate:
>
> For output the transcoded stream must be flushed but not closed;
> probably flushed in the sense of notifying the transcoder about end of
> data (there are other modes of flushing when we consider compression).
>
> For input, if the length of the portion to be decoded is not known
> beforehand, but is implied by the result of the decoding, then
> decoding must be performed one character at a time. This is slow but
> unavoidable. Buffering of transcoding must be somehow turned off.

For output, my revised notes solve the problem by allowing
the transition only from binary output to textual output.
The transcoder associated with a text port cannot be changed.

If you want the effect of changing the transcoder associated
with a text port, whether for input or for output, you would
have to use binary i/o, performing the transcoding in Scheme.
The proposed procedures are adequate for doing this with
reasonable efficiency, assuming the transcoders you want to
use are among those provided by the implementation.

Marcin Kowalczyk concluded:
> I don't know of any good way of avoiding these complexities. Trying to
> avoid them would either limit expressiveness or make transcoding slow
> in the usual case.

The restrictions I have placed on transcoded-port should make
transcoding fast in the usual case. They would not actually
reduce expressiveness for unusual cases, but they would make
unusual cases more awkward; see the (untested) mess of code
toward the end of my current notes.

To prevent programmers from rewriting that mess hundreds of
times, I think the R6RS should provide something like
get-char-from-binary and lookahead-char-from-binary.
Using hypothetical versions of those procedures, the cruft
in my notes would simplify to the following cruft:

    (let* ((p (open-file-input-port "foo.xml"
                                    (file-options)
                                    (binary-transcoder)))
           (utf8-transcoder
            (make-transcoder (utf-8-codec) 'lf 'replace))
           (local-peek-char
            (lambda () (lookahead-char-from-binary p utf8-transcoder)))
           (local-read-char
            (lambda () (get-char-from-binary p utf8-transcoder)))
           (transcoder
            (parse-xml-prefix local-peek-char local-read-char))
           (input (transcoded-port p transcoder)))
      (parse-rest-of-xml input))

Alternatively, the lookahead-char and get-char procedures
could accept an optional transcoder argument; when that
optional argument is supplied, the first argument would
have to be a binary input port.

John Cowan wrote:
> That reminds me: there should be a procedure to apply a transcoder to a
> string and produce a byte object (virtual output) and another to apply
> a transcoder to a byte object and produce a string (virtual input).

Agreed. I think those procedures should be trivial to
define using (appropriate revisions of) the procedures
for bytes and string ports, e.g.

    (define (transduce-bytes bytes transcoder)
      (let ((out (open-bytes-output-port)))
        (put-bytes out bytes)
        (get-output-string out transcoder)))

The general silliness and potential inefficiency of that
code argue for making transduce-bytes and transduce-string
into standard procedures.

Will
Received on Tue Nov 07 2006 - 11:33:39 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:00 UTC