[r6rs-discuss] Stateful codecs and inefficient transcoding
From: William D Clinger <will_at_ccs.neu.edu>
Subject: [r6rs-discuss] Stateful codecs and inefficient transcoding
Date: Mon, 30 Oct 2006 14:59:14 -0500
> (define (guess-the-transcoder port)
> (let* ((zwnbsp #\xfeff)
> (utf-8 (transcoder (codec (utf-8-codec))))
> (utf-16le (transcoder (codec (utf-16le-codec))))
> (utf-16be (transcoder (codec (utf-16be-codec))))
> (utf-32le (transcoder (codec (utf-32le-codec))))
> (utf-32be (transcoder (codec (utf-32be-codec))))
> (c8 (lookahead-char port utf-8))
> (c16le (lookahead-char port utf-16le))
> (c16be (lookahead-char port utf-16be))
> (c32le (lookahead-char port utf-32le))
> (c32be (lookahead-char port utf-32be)))
> (cond ((char=? c8 zwnbsp)
> utf8)
> ((char=? c16le zwnbsp)
> utf-16le)
> ((char=? c16be zwnbsp)
> utf-16be)
> ((char=? c32le zwnbsp)
> utf-32le)
> ((char=? c32be zwnbsp)
> utf-32be)
> (else
> utf-8))))
Thank you for the illustrative example. I have a few thoughts
about this kind of "transient transcoder" usage.
(1) As noted in the lookahead-char entry, this kind of coding
requires some amount of lookahead from the port. In fact,
it requires potentially unlimited amount---although highly
unlikely, iso2022 encoding *can* have arbitrary number
of escape sequences before it hits the 'real character'.
I wonder if it is the intention of the designers that
requires unlimited lookahead for comforming implementation
of a port (the standard doesn't specify iso2022 as standard
codec, but it is a *must* support if you want to write a
practical application that handles emails).
Or, the comforming implementation can be such that it only
supports typical cases (an escape sequence always followed by
a real character) and may report implementation limitation
violation? I think it is plausible strategy, but if that's
the case I feel it is better to put a note something like
"the implementation may limit the number of lookahead
from the port, and should raise an error if the input stream
consists of the data that requires more lookahead than such
a limitation."
(2) This kind of transcoder guessing only works for very
limited cases in practice. In most cases you have to
lookahead quite a few octets to find out the input encoding.
So I feel this example of using transient transcoder with
lookahead-char shows not much benefit (you can work on
lower-level reader to handle this case, as well as more
complicated guessing). Are there any other examples that
shows the usefulness of the transient encoder?
(3) Output functions such as put-char and put-string also
accepts a transient transcoder. Since there's no place
to keep the encoding state, they must produce an output
that is "self consistent", i.e. that always begins and ends
with the default state:
(begin
(put-char #\x3042 iso-2022-jp-transcoder)
(put-char #\x3043 iso-2022-jp-transcoder))
=> produces octet sequence:
[1b 24 42 24 22 1b 28 42 1b 24 42 24 23 1b 28 42]
whereas:
(put-string "\x3042;\x3043;" iso-2022-jp-transcoder)
=> produces octet sequence:
[1b 24 42 24 22 24 23 1b 28 42]
Or, a worse example:
(begin
(put-char #\x304b euc-jp-transcoder)
(put-char #\x309a euc-jp-transcoder))
=> error (because EUC-JP doesn't have codepoint for U+309a)
or produce close but inaccurate approximation [a4 ab a1 ac]
whereas:
(put-string "\x304b;\x309a;" euc-jp-transcoder)
=> produces octet sequence [a4 f7]
Aren't these confusing? Especially, if the transcoder is passed
as a parameter, you cannot reliably interchange put-char and
put-string (unless you know you live in Unicode-only world).
Given this complexity, I don't see much benefit in having this
kind of transient transcoder.
I feel the complexity comes from the fact that transcoding
in inherently a streaming operation, but the transient
transcoder forces such stream to be cut for each I/O
procedure call. I'd rather have:
(a) an explicit lookahead operation on port, like
lookahead-bytes-n.
(b) transcoders defined as a special kind of layered port;
an input transcoding port reads octets from source port
and and produces characters. An output transcoding port
takes characters and puts octets to the drain port.
(With lookahead-bytes-n and reader/writer facility, you
can implement such transcoding ports in Scheme, I think;
though the implementaiton may provide more efficient one
that uses implementation-specific internal structure of ports).
--shiro
Received on Mon Oct 30 2006 - 19:47:54 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC