[r6rs-discuss] Stateful codecs and inefficient transcoding

From: William D Clinger <will>
Date: Mon Oct 30 15:00:11 2006

I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors.

Marcin 'Qrczak' Kowalczyk wrote:
> Note that UTF-x with a BOM are stateful codecs because it must
> remember whether it has written (on encoding) or read (on decoding)
> the BOM at the beginning.

With the draft R6RS, the efficient idiom for using
a BOM to select the transcoder to use for buffered
input of the entire contents of a port would run
something like this:

(define (get-string-all-using-BOM port)
  (get-string-all port (guess-the-transcoder port)))

where guess-the-transcoder would be defined something
like this:

(define (guess-the-transcoder port)
  (let* ((zwnbsp #\xfeff)
         (utf-8 (transcoder (codec (utf-8-codec))))
         (utf-16le (transcoder (codec (utf-16le-codec))))
         (utf-16be (transcoder (codec (utf-16be-codec))))
         (utf-32le (transcoder (codec (utf-32le-codec))))
         (utf-32be (transcoder (codec (utf-32be-codec))))
         (c8 (lookahead-char port utf-8))
         (c16le (lookahead-char port utf-16le))
         (c16be (lookahead-char port utf-16be))
         (c32le (lookahead-char port utf-32le))
         (c32be (lookahead-char port utf-32be)))
    (cond ((char=? c8 zwnbsp)
           utf8)
          ((char=? c16le zwnbsp)
           utf-16le)
          ((char=? c16be zwnbsp)
           utf-16be)
          ((char=? c32le zwnbsp)
           utf-32le)
          ((char=? c32be zwnbsp)
           utf-32be)
          (else
           utf-8))))

> Since it's allowed to call get-char, get-line etc. and then get-u8
> or get-bytes-n, with the bytes being those immediately following
> the encoded form of the characters, decoding must be performed one
> character at a time, since the decision when to stop is based on the
> resulting character sequence. This is slow because transcoder dispatch
> and setting up loops cannot be amortized across decoding many
> characters.

With the default transcoder, you can expect the
draft R6RS get-char to be about as fast as current
implementations of read-char.

In other words, you can expect get-char to deliver
about half the performance of C's getc when reading
characters one at a time. On a 2.8 GHz, 32-bit
Pentium from several years ago, that's about 20
million characters per second. (The performance
might be better in unsafe mode, but I haven't
benchmarked that.)

If that isn't fast enough, you can read more than
one character at a time, as illustrated above.

> It would be better to make text streams on top of binary streams,
> and putting buffering once on the top instead of embedding a buffering
> layer in each port. All intermediate conversions should be performed
> a block at a time, the buffering layer switches to individual
> characters.

Couldn't an R6RS program do that by using get-bytes-all
(or get-bytes-n, get-bytes-n!, or get-bytes-some) to
read the binary data and then applying any binary-to-text
conversion it wants?

I too have some concerns about the draft R6RS i/o
system, but I don't yet understand why I should be
concerned about the efficiency of stateful codecs
or character-at-a-time input.

Will
Received on Mon Oct 30 2006 - 14:59:14 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC