[r6rs-discuss] Stateful codecs and inefficient transcoding

From: Marcin 'Qrczak' Kowalczyk <qrczak>
Date: Tue Oct 31 06:06:38 2006

William D Clinger <will_at_ccs.neu.edu> writes:

> I don't think it is possible to design an i/o system that satisfies
> both of the following requirements:
>
> * arbitrary mixing of binary and textual i/o
> * efficient support for all possible transcoders

Indeed, not at the same time. But it's possible to design an i/o
system which gives the choice. And I would give efficient support for
arbitrary transcoders as the default, which is faster even for simple
codecs for which the other choice works at all.

> The draft R6RS does not require or envision any mutable state
> in transcoders, nor does it provide or require any means for
> maintaining mutable transcoder state across invocations of textual
> i/o procedures.

It's unclear whether these two snippets are equivalent for all
transcoders:

  (let ((port (open-file-input-port filename (file-options) transcoder)))
    (let loop ()
      (let ((ch (get-char port))) ; or perhaps read-char
        (unless (eof-object? c)
          (write-char ch)
          (loop)))))

  (let ((port (open-file-input-port filename)))
    (let loop ()
      (let ((ch (get-char port transcoder)))
        (unless (eof-object? c)
          (write-char ch)
          (loop)))))

If yes, then stateful transcoders are not supported at all.

>> It could work only if the port maintains a transcoder-dependent
>> state which is passed to transcoder invocations, but then using
>> different transcoders for different I/O operations would mix up
>> incompatible states.
>
> The ports of the draft R6RS *do* maintain a transcoder as part
> of their state, and this transcoder is used implicitly by most
> procedures in (r6rs io simple).

I know that it maintains the transcoder, but does it maintain the
current transcoder-dependent state of the ongoing conversion?

> Although the draft R6RS does not have your hypothetical utf-16-codec
> that relies on an initial BOM to select the endianness, I believe
> the draft R6RS allows that hypothetical codec as an extension. That
> extension could be implemented efficiently when it is specified as
> an argument to any of the first three procedures listed above: the
> implementation could peek at the first two bytes, decide whether to
> use UTF-16BE or UTF-16LE, and could install one of those two as the
> transcoder associated with the port.

This suffices only in the special case where the statefulness
of the transcoder manifests only at the beginning: no ISO-2022,
no transparent compression, no iconv() in general unless the encoding
is known to be stateless (even iconv's UTF-16 would not work).

> As a general principle, I believe implementations that make such
> extensions become responsible for providing a reasonable semantics
> for the extensions.

It's impossible to provide a reasonable implementation of a poorly
chosen interface.

> more importantly, I have investigated the efficiency of the proposed
> interface, and I know it can be implemented efficiently for the
> specific transcoders envisioned by the draft R6RS.

And only for them. This interface doesn't scale to other transcoders.
Please consider how to implement an iconv transcoder.

I don't buy a design where using iconv instead of one of a few builtin
codecs requires using a completely different interface.

Worse: an interface where iconv could be used cannot be constructed
from pieces provided by R6RS except the lowest level. Everything above
readers and writers, i.e. ports, would have to be reimplemented, or
would have to use implementation-specific hooks not specified by R6RS,
in order to support iconv.

> * omitting or specifying the optional transcoder is
> more efficient, in the common case where a compiler
> can determine which transcoder is to be used, than
> having each operation fetch the transcoder out of
> a port object

It's less efficient, not more efficient, because buffering cannot be
applied above the transcoder. In my design buffering is typically
on the top, and thus it amortizes calls to the transcoder, which is
invoked once per block rather than once per character, and thus it
doesn't matter where it is fetched from.

>> Binary streams and text streams are distinguished.
>
> That is a common design choice, but it is a limiting choice. There
> are several important file formats, e.g. MPEG, that contain both
> binary and textual data.

It contains binary data, and some fragments of it can be interpreted
as text.

> Furthermore I am told that some important file formats, e.g. XML,
> use several different textual encodings.

It uses a single encoding, specified at the top of the file.
In my design it can be read thus:

1. A byte buffering layer is put on the input.
2. It is peeked to check for the BOM and the encoding.
3. A text decoder with the recognized encoding (together with
   character buffering) is put on the byte buffering layer.
   XML parsing proceeds from there.

For HTML, where encoding can be specified with <meta http-equiv=...>,
and thus requires non-trivial HTML parsing before the encoding can
be found, this can be modified as follows:

2.1. A copying input stream (which translates imperative i/o to
     lookahead requests of the underlying stream), ISO-8859-1 decoder,
     and character buffer are put on the byte buffering layer.
2.2. HTML is parsed from there, and headers are scanned to look for
     the encoding.

It's also possible to mimic the inefficient R6RS design, which is
somewhat easier but inefficient:

1. A byte buffering layer with a single-byte buffer is put on the
   input, to ensure that decoding consumes as few bytes as possible
   for each character.
2. UTF-8 decoder is put above it.
3. A switching input stream (which can have the underlying stream
   switched at any time) is put above it.
4. As soon as the element specifying the encoding is detected, the
   switching layer is switched to a properly decoded stream based
   on the same input.

>> Most transformations are designed to be performed a block at a time,
>> moving bytes or characters between buffers. Most kinds of layers
>> support only block I/O.
>
> In my view, the draft R6RS supports this model, though not as
> conveniently as one might like.

I'm afraid it does not. Please try to implement iconv transcoder.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak_at_knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
Received on Tue Oct 31 2006 - 06:06:28 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC