The design of specifying an I/O transcoder on each operation rules
out stateful codecs (like ISO-2022, or UTF-x with a BOM). The design
allowing arbitrary mixing of character and binary I/O rules out
buffering of transcoder calls, which makes character reading
unnecessarily inefficient.
A stateful codec must maintain a state across different calls
to get-char on the same port. There is no place to store the
transcoder-dependent state when it's specified on each get-char call
separately. This means that either the port-specific transcoder uses
different rules than specifying the transcoder on an I/O call, in
particular that different sets of transcoders are compatible with
these two contexts, or stateful codecs are not supported at all.
Note that UTF-x with a BOM are stateful codecs because it must
remember whether it has written (on encoding) or read (on decoding)
the BOM at the beginning.
Since it's allowed to call get-char, get-line etc. and then get-u8
or get-bytes-n, with the bytes being those immediately following
the encoded form of the characters, decoding must be performed one
character at a time, since the decision when to stop is based on the
resulting character sequence. This is slow because transcoder dispatch
and setting up loops cannot be amortized across decoding many
characters. This is especially bad in the case of an externally
implemented codec like iconv, where the setup cost of establishing
buffers for talking with C might be large relative to decoding of
a single character. Buffering should be put as the top layer, to
amortize the cost of as many layers as possible.
The combined effect of both flaws is yet worse. If an iconv transcoder
is specified on each I/O call, then each time the C library performs
a lookup of encoding names (the iconv interface doesn't separate the
lookup from starting an instance of a stateful transcoder).
It would be better to make text streams on top of binary streams,
and putting buffering once on the top instead of embedding a buffering
layer in each port. All intermediate conversions should be performed
a block at a time, the buffering layer switches to individual
characters.
--
__("< Marcin Kowalczyk
\__/ qrczak_at_knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Received on Mon Oct 30 2006 - 07:39:36 UTC