[r6rs-discuss] Stateful codecs and inefficient transcoding from William D Clinger on 2006-10-30 (r6rs-discuss.mbox)

From: William D Clinger <will>
Date: Mon Oct 30 16:24:16 2006

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.

Per Bothner quoting me:
> > In other words, you can expect get-char to deliver
> > about half the performance of C's getc when reading
> > characters one at a time.
>
> In many cases, sure. But consider a call to get-char. It cannot
> "transcode-ahead", since the next call could be a get-u8.

The claim you quoted was qualified: "With the default
transcoder". I measured the performance without any
"transcode-ahead", so the performance I reported is
what you should expect even if the next call could be
to get-u8.

> More complex table-driven decoding would be ridiculous
> to do in Scheme. Not a priori, but because it makes much more
> sense to use existing libraries, such as iconv.

Agreed.

> So you really have to explain how you would implement
> character decoding using iconv while still being only
> twice as slow as C, and allowing a get-char to be followed
> by a get-u8.

In my previous message, I explained:

    * how character-at-a-time decoding with the default
      transcoder would be only twice as slow as C's getc,
      while allowing any get-char to be followed by a
      get-u8

    * how character decoding could call iconv, with about
      the same performance as in C, in programs that do
      not need to follow a get-char with a get-u8

> Recommmendation: allow get-char/read/... after get-u8/get-bytes-n/...
> but do not require an implementation to support the converse
> (reading bytes after reading characters).

I do not understand how this restriction would improve
performance beyond what we can already expect with the
draft R6RS semantics.

> Or only require support
> for reading bytes after reading characters for a few simple
> standard encodings - primarily UTF8.

That is exactly what the draft R6RS does. The complete
list of encodings for which the draft R6RS would require
support for reading bytes after characters is:

    Latin-1
    UTF-8
    UTF-16LE
    UTF-16BE
    UTF-32LE
    UTF-32BE

Will
Received on Mon Oct 30 2006 - 16:24:08 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC