[r6rs-discuss] Stateful codecs and inefficient transcoding from William D Clinger on 2006-10-31 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Tue Oct 31 00:51:06 2006

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.

Marcin 'Qrczak' Kowalczyk wrote:
> Discussion would be easier if R6RS had proivided a mechanism for
> implementing codecs or transcoders, which would reveal the protocol
> used to maintain transcoder state across invocations.

The draft R6RS does not require or envision any
mutable state in transcoders, nor does it provide or
require any means for maintaining mutable transcoder
state across invocations of textual i/o procedures.

> I meant that this usage: [...] can't work well for a
> hypothetical (utf-16-codec) which implements UTF-16
> Encoding Scheme, i.e. where an initial BOM selects the
> endianness, defaulting to BE.

True.

> It could work only if the port maintains
> a transcoder-dependent state which is passed to transcoder invocations,
> but then using different transcoders for different I/O operations
> would mix up incompatible states.

The ports of the draft R6RS *do* maintain a transcoder
as part of their state, and this transcoder is used
implicitly by most procedures in (r6rs io simple).
That transcoder can be specified by the following
procedures:

    open-file-input-port
    open-bytes-input-port
    open-string-input-port
    open-file-output-port
    call-with-bytes-output-port
    call-with-string-output-port
    open-file-input/output-port

Although the draft R6RS does not have your hypothetical
utf-16-codec that relies on an initial BOM to select
the endianness, I believe the draft R6RS allows that
hypothetical codec as an extension. That extension
could be implemented efficiently when it is specified
as an argument to any of the first three procedures
listed above: the implementation could peek at the
first two bytes, decide whether to use UTF-16BE or
UTF-16LE, and could install one of those two as the
transcoder associated with the port.

A problem with this extension becomes apparent when
the hypothetical transcoder is supplied as an argument
to one of the last four procedures listed above, but
that problem could be resolved by treating (utf-16-codec)
the same (in that particular context) as either
(utf-16be-codec) or (utf-16le-codec), perhaps according
to some platform-specific default.

Another problem becomes apparent when the hypothetical
codec is supplied as an explicit argument to a textual
i/o procedure such as get-char or put-char. Again,
the ambiguity could be resolved somewhat arbitrarily,
or the implementation might resolve the ambiguity by
preserving some memory of the first two bytes within
the port.

As a general principle, I believe implementations that
make such extensions become responsible for providing
a reasonable semantics for the extensions.

> I claim that the interface of specifying a transcoder at individual
> I/O operations is a bad idea, because it encourages coding style
> which is incompatible with stateful encodings, and is inefficient.

I don't like it much myself, but not for the two
reasons you gave. For one thing, I harbor a prejudice
against stateful encodings; more importantly, I have
investigated the efficiency of the proposed interface,
and I know it can be implemented efficiently for the
specific transcoders envisioned by the draft R6RS.

In my opinion---and I am not speaking for the editors
here or elsewhere---the draft R6RS provides the optional
transcoder arguments for two reasons:

  * some file formats use different textual encodings
    at different places within the file

  * omitting or specifying the optional transcoder is
    more efficient, in the common case where a compiler
    can determine which transcoder is to be used, than
    having each operation fetch the transcoder out of
    a port object

> With a non-default transcoder though, e.g. implemented using system's
> iconv(), I expect iconv() to be called for each character separately
> (in order to support mixing text and binary I/O), which is probably
> much less efficient than recoding a block at a time. And it's
> unnecessary for the majority of programs which don't mix text and
> binary I/O.

I do not expect implementations to call iconv for any
of the transcoders that would be required by the draft
R6RS, and I think it would be unwise for implementations
to provide other transcoders that are implemented by
calling iconv separately for each character. I think
it would make a lot more sense for implementations to
provide iconv in some library, and to encourage programmers
to compose buffered binary i/o with iconv.

By the way, I don't think the buffer modes of the draft
R6RS are very meaningful or useful. When I say buffered,
I mean buffering performed by the program itself, not by
the implementation in response to those buffer modes.

> Another issue: lookahead-char is unimplementable in terms of
> lookahead-u8.

True. IIRC, the draft R6RS requires ports to maintain
a four-byte lookahead buffer.

> Is there a reference implementation of the I/O design?

Not yet. The proposed (r6rs io ports) and (r6rs io simple)
libraries are very similar to extant extensions of the
R5RS i/o system, however, and the Scheme community has
a lot of experience with those implementations.

> It seems that R6RS provides no interface for building an input port
> which is buffered above the transcoder, such that when get-line is
> used to read contents, the transcoder is invoked once per a large
> block rather than once per character. Custom readers are designed
> for the binary layers only.

If I understand you correctly, that is true.

> Here are some aspects of the I/O design for my language (implemented):
>
> Binary streams and text streams are distinguished.

That is a common design choice, but it is a limiting
choice. There are several important file formats,
e.g. MPEG, that contain both binary and textual data.
Furthermore I am told that some important file formats,
e.g. XML, use several different textual encodings.

> Most transformations are designed to be performed a block at a time,
> moving bytes or characters between buffers. Most kinds of layers
> support only block I/O.

In my view, the draft R6RS supports this model, though
not as conveniently as one might like.

> Newline recoding is done with a separate layer operating on the text
> level, just above the recoding between bytes and characters.

Presumably this layer would not be implemented by a
separate pass, since that would be inefficient.

Will
Received on Tue Oct 31 2006 - 00:51:00 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC