[r6rs-discuss] Stateful codecs and inefficient transcoding
I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.
Per Bothner quoting me:
> > In my previous message, I explained:
> >
> > * how character-at-a-time decoding with the default
> > transcoder would be only twice as slow as C's getc,
> > while allowing any get-char to be followed by a
> > get-u8
>
> No, you *stated* that. If there was any supporting explanation
> or evidence, I can't see it.
I apologize for being so terse. In this message,
I will try to err in the other direction.
> Are you assuming the default transcoder is one of the standard codecs
> and so would not need to use an external library like iconv?
By "default transcoder", I meant the transcoder that
an implementation uses for procedures like get-char
when no explicit transcoder argument is given.
According to draft R6RS section 15.3.5, paragraph 1,
the default transcoder *must* be "UTF-8 with a
platform-specific end-of-line convention."
That may well be unreasonable, but it is what the
draft says.
> Or are you referring to implementation experience?
Yes.
If the default transcoder were for Latin-1 with the lf
eol-style, then the implementation of get-char would
be *exactly* the same as the implementation of read-char
in the current development version of Larceny. Since
the default transcoder is for UTF-8, the implementation
of get-char would be *almost* exactly the same as the
current implementation of read-char.
On the most common path, for ASCII characters, the
code for UTF-8 and the lf eol-style would add only
two machine instructions (a compare and a conditional
branch) to the code for Latin-1 with the lf eol-style.
Even the uncommon paths would add only a few more
instructions, so using UTF-8 as the default transcoding
should slow things down by only 2% or so.
That's too small a slowdown to measure accurately.
I implemented it, and tried to measure the slowdown,
but it was in the noise. The effects of repeatable
but hard-to-understand things such as branch target
alignment with respect to instruction cache boundaries
are more important than the two extra instructions.
I cross-checked the performance I measured in Larceny
against the performance of Bigloo. Reading only one
character at a time, both systems read between 15 and
25 million characters per second on my test machine,
which is about half the performance of C's getc.
Chez Scheme would probably be faster, but most other
systems would be slower. The reasons for the slower
systems being slower do not have anything to do with
i/o, however. For the most part, the slower systems
are slower because of the quality of their compilers,
or with the lack thereof. Since the i/o primitives
of the slower systems are usually written in languages
other than Scheme, it seems likely that the relative
slowdown due to replacing ASCII or Latin-1 by UTF-8
will be even less in the slower systems than in the
systems I benchmarked.
> > * how character decoding could call iconv, with about
> > the same performance as in C, in programs that do
> > not need to follow a get-char with a get-u8
>
> But how is the implementation supposed to know this, except
> when something like get-string-all is called?
What I am suggesting is that programs that want to
use iconv should read the input as binary and then
call iconv without relying on the transcoders and
textual i/o of the draft R6RS.
I don't think it is possible to design an i/o system
that satisfies both of the following requirements:
* arbitrary mixing of binary and textual i/o
* efficient support for all possible transcoders
(Consider, for example, the transcoding that reverses
the order of bits within a file, interprets the result
as the input tape for some specific universal Turing
machine, and produces the sequence of characters that
are output by that machine, for some specific meaning
of "output".)
The i/o system of the draft R6RS isn't intended to
support all possible transcoders. Its intent is to
support efficient mixing of binary and textual i/o
down to the byte level (but no lower) for a small
set of stateless transcoders that don't require much
lookahead.
For stateful transcoders, and for transcoders that
require a lot of lookahead, I think the right thing
to do is to compose something like the draft's binary
i/o with something like iconv.
> Those are the only encodings that R6RS requires. But presumably
> an implementation can provide others. I don't see anything that
> mentions a restriction on mixing byte/char input if a non-standard
> codec is used, though I perhaps I missed it.
The draft R6RS places no restrictions on mixing
binary and textual input. That means there is an
implied restriction on encodings and efficiency:
Implementations can provide any stateless codecs
they want in addition to those listed, but those
implementations must figure out how to mix binary
with textual i/o for all codecs they supply, and
programmers must live with any inefficiencies that
result.
That may well be unreasonable, but it is what the
draft implies.
Will
Received on Tue Oct 31 2006 - 00:46:34 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC