[r6rs-discuss] Stateful codecs and inefficient transcoding from Marcin 'Qrczak' Kowalczyk on 2006-10-30 (r6rs-discuss.mbox)

From: Marcin 'Qrczak' Kowalczyk <qrczak>
Date: Mon Oct 30 18:40:05 2006

William D Clinger <will_at_ccs.neu.edu> writes:

>> Note that UTF-x with a BOM are stateful codecs because it must
>> remember whether it has written (on encoding) or read (on decoding)
>> the BOM at the beginning.
>
> With the draft R6RS, the efficient idiom for using a BOM to select
> the transcoder to use for buffered input of the entire contents of a
> port would run something like this:

Discussion would be easier if R6RS had proivided a mechanism for
implementing codecs or transcoders, which would reveal the protocol
used to maintain transcoder state across invocations.

I meant that this usage:

  (let ((port (open-file-input-port filename))
        (transcoder (make-transcoder (utf-16-codec))))
    (let loop ()
      (let ((ch (get-char port transcoder)))
        (unless (eof-object? c)
          (write-char ch)
          (loop)))))

can't work well for a hypothetical (utf-16-codec) which implements
UTF-16 Encoding Scheme, i.e. where an initial BOM selects the
endianness, defaulting to BE. It could work only if the port maintains
a transcoder-dependent state which is passed to transcoder invocations,
but then using different transcoders for different I/O operations
would mix up incompatible states.

The issue is similar to case mapping of strings: it's not sufficient
to map individual characters context-insensitively, even though it
works in common cases. A codec in general applies to the whole
sequence, rather than to individual characters.

I claim that the interface of specifying a transcoder at individual
I/O operations is a bad idea, because it encourages coding style
which is incompatible with stateful encodings, and is inefficient.

Specifying the transcoder with a port is fine. In order to switch
encodings, there should be a way to obtain a port which filters an
existing port through a transcoder. In the case of input, switching
codecs freely (without knowing upfront how many bytes or characters
will be converted) is necessarily inefficient: in this case buffering
above the transcoder must be switched off so as few characters as
possible are transcoded at a time. But at least it has a well-defined
semantics no matter whether the encoding is stateful or not: the state
is reset explicitly when the subport is created, and applies for a
whole sequence of I/O operations.

> With the default transcoder, you can expect the draft R6RS get-char
> to be about as fast as current implementations of read-char.

With a non-default transcoder though, e.g. implemented using system's
iconv(), I expect iconv() to be called for each character separately
(in order to support mixing text and binary I/O), which is probably
much less efficient than recoding a block at a time. And it's
unnecessary for the majority of programs which don't mix text and
binary I/O.

Another issue: lookahead-char is unimplementable in terms of
lookahead-u8. This means that ports must contain a particular
functionality (of buffering with a far lookahead) which is not
exposed in the whole generality it is ready to provide, but is
used only in the limited way of lookahead-char.

Is there a reference implementation of the I/O design?

>> It would be better to make text streams on top of binary streams,
>> and putting buffering once on the top instead of embedding a buffering
>> layer in each port. All intermediate conversions should be performed
>> a block at a time, the buffering layer switches to individual
>> characters.
>
> Couldn't an R6RS program do that by using get-bytes-all
> (or get-bytes-n, get-bytes-n!, or get-bytes-some) to
> read the binary data and then applying any binary-to-text
> conversion it wants?

It seems that R6RS provides no interface for building an input port
which is buffered above the transcoder, such that when get-line is
used to read contents, the transcoder is invoked once per a large
block rather than once per character. Custom readers are designed
for the binary layers only.

The program could at most reimplement the whole I/O design
differently. In particular it must reimplement the get-line logic
of recognizing newlines.

Here are some aspects of the I/O design for my language (implemented):

Binary streams and text streams are distinguished.

Most transformations are designed to be performed a block at a time,
moving bytes or characters between buffers. Most kinds of layers
support only block I/O.

Typically the topmost layer provides buffering (of bytes or characters),
including reading a character or a line at a time, far lookahead,
unreading, and automatic flushing on writing (after each write or
after each line).

Newline recoding is done with a separate layer operating on the text
level, just above the recoding between bytes and characters.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak_at_knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Received on Mon Oct 30 2006 - 18:39:39 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC