William D Clinger <will_at_ccs.neu.edu> writes:
>> Note that UTF-x with a BOM are stateful codecs because it must
>> remember whether it has written (on encoding) or read (on decoding)
>> the BOM at the beginning.
>
> With the draft R6RS, the efficient idiom for using a BOM to select
> the transcoder to use for buffered input of the entire contents of a
> port would run something like this:
Discussion would be easier if R6RS had proivided a mechanism for
implementing codecs or transcoders, which would reveal the protocol
used to maintain transcoder state across invocations.
I meant that this usage:
(let ((port (open-file-input-port filename))
(transcoder (make-transcoder (utf-16-codec))))
(let loop ()
(let ((ch (get-char port transcoder)))
(unless (eof-object? c)
(write-char ch)
(loop)))))
can't work well for a hypothetical (utf-16-codec) which implements
UTF-16 Encoding Scheme, i.e. where an initial BOM selects the
endianness, defaulting to BE. It could work only if the port maintains
a transcoder-dependent state which is passed to transcoder invocations,
but then using different transcoders for different I/O operations
would mix up incompatible states.
The issue is similar to case mapping of strings: it's not sufficient
to map individual characters context-insensitively, even though it
works in common cases. A codec in general applies to the whole
sequence, rather than to individual characters.
I claim that the interface of specifying a transcoder at individual
I/O operations is a bad idea, because it encourages coding style
which is incompatible with stateful encodings, and is inefficient.
Specifying the transcoder with a port is fine. In order to switch
encodings, there should be a way to obtain a port which filters an
existing port through a transcoder. In the case of input, switching
codecs freely (without knowing upfront how many bytes or characters
will be converted) is necessarily inefficient: in this case buffering
above the transcoder must be switched off so as few characters as
possible are transcoded at a time. But at least it has a well-defined
semantics no matter whether the encoding is stateful or not: the state
is reset explicitly when the subport is created, and applies for a
whole sequence of I/O operations.
> With the default transcoder, you can expect the draft R6RS get-char
> to be about as fast as current implementations of read-char.
With a non-default transcoder though, e.g. implemented using system's
iconv(), I expect iconv() to be called for each character separately
(in order to support mixing text and binary I/O), which is probably
much less efficient than recoding a block at a time. And it's
unnecessary for the majority of programs which don't mix text and
binary I/O.
Another issue: lookahead-char is unimplementable in terms of
lookahead-u8. This means that ports must contain a particular
functionality (of buffering with a far lookahead) which is not
exposed in the whole generality it is ready to provide, but is
used only in the limited way of lookahead-char.
Is there a reference implementation of the I/O design?
>> It would be better to make text streams on top of binary streams,
>> and putting buffering once on the top instead of embedding a buffering
>> layer in each port. All intermediate conversions should be performed
>> a block at a time, the buffering layer switches to individual
>> characters.
>
> Couldn't an R6RS program do that by using get-bytes-all
> (or get-bytes-n, get-bytes-n!, or get-bytes-some) to
> read the binary data and then applying any binary-to-text
> conversion it wants?
It seems that R6RS provides no interface for building an input port
which is buffered above the transcoder, such that when get-line is
used to read contents, the transcoder is invoked once per a large
block rather than once per character. Custom readers are designed
for the binary layers only.
The program could at most reimplement the whole I/O design
differently. In particular it must reimplement the get-line logic
of recognizing newlines.
Here are some aspects of the I/O design for my language (implemented):
Binary streams and text streams are distinguished.
Most transformations are designed to be performed a block at a time,
moving bytes or characters between buffers. Most kinds of layers
support only block I/O.
Typically the topmost layer provides buffering (of bytes or characters),
including reading a character or a line at a time, far lookahead,
unreading, and automatic flushing on writing (after each write or
after each line).
Newline recoding is done with a separate layer operating on the text
level, just above the recoding between bytes and characters.
--
__("< Marcin Kowalczyk
\__/ qrczak_at_knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Received on Mon Oct 30 2006 - 18:39:39 UTC