[r6rs-discuss] Stateful codecs and inefficient transcoding

From: Per Bothner <per>
Date: Mon Oct 30 17:31:29 2006

William D Clinger wrote:
>> So you really have to explain how you would implement
>> character decoding using iconv while still being only
>> twice as slow as C, and allowing a get-char to be followed
>> by a get-u8.
>
> In my previous message, I explained:
>
> * how character-at-a-time decoding with the default
> transcoder would be only twice as slow as C's getc,
> while allowing any get-char to be followed by a
> get-u8

No, you *stated* that. If there was any supporting explanation
or evidence, I can't see it. I'm assuming you mean this message:
http://lists.r6rs.org/pipermail/r6rs-discuss/2006-October/000509.html
Are you assuming the default transcoder is one of the standard codecs
and so would not need to use an external library like iconv?
Or are you referring to implementation experience?

> * how character decoding could call iconv, with about
> the same performance as in C, in programs that do
> not need to follow a get-char with a get-u8

But how is the implementation supposed to know this, except
when something like get-string-all is called?

>> Recommmendation: allow get-char/read/... after get-u8/get-bytes-n/...
>> but do not require an implementation to support the converse
>> (reading bytes after reading characters).
>
> I do not understand how this restriction would improve
> performance beyond what we can already expect with the
> draft R6RS semantics.

Because otherwise it seems very difficult to efficiently
use an existing library like iconv. Calling iconv for
each character seems unacceptable, but otherwise synchronizing
between a char stream and the underlying byte stream seems
impractical. The best I can think of is optimistic decoding:
On get-char decode a whole buffer. If we then see a get-u8,
go back to the position where we last called iconv, and then
call iconv with an output buffer containing only the
previously-converted-and-read characters. That way iconv
will hopefully run out of space at the point corresponding
to the current position, and the input byte position lets
us synchronize to the next byte.

Not exactly elegant: We may have to remember more state;
we have have to decode a buffer multiple times (quadratic
behavior in the worst case); plus any encodings that are
stateful or use fractional bytes (such as hexBinary) may
not work.

>> Or only require support
>> for reading bytes after reading characters for a few simple
>> standard encodings - primarily UTF8.
>
> That is exactly what the draft R6RS does. The complete
> list of encodings for which the draft R6RS would require
> support for reading bytes after characters is:
>
> Latin-1
> UTF-8
> UTF-16LE
> UTF-16BE
> UTF-32LE
> UTF-32BE

Those are the only encodings that R6RS requires. But presumably
an implementation can provide others. I don't see anything that
mentions a restriction on mixing byte/char input if a non-standard
codec is used, though I perhaps I missed it.
-- 
	--Per Bothner
per_at_bothner.com   http://per.bothner.com/
Received on Mon Oct 30 2006 - 17:31:56 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC