[r6rs-discuss] Stateful codecs and inefficient transcoding

From: Per Bothner <per>
Date: Tue Oct 31 01:26:19 2006

William D Clinger wrote:
> By "default transcoder", I meant the transcoder that
> an implementation uses for procedures like get-char
> when no explicit transcoder argument is given.
> According to draft R6RS section 15.3.5, paragraph 1,
> the default transcoder *must* be "UTF-8 with a
> platform-specific end-of-line convention."

Sorry - I wasn't sure what you meant by "default transcoder".
In other environments the "default transcoder" is the one the
system picks for you *depending on your implicit locale*.

> If the default transcoder were for Latin-1 with the lf
> eol-style, then the implementation of get-char would
> be *exactly* the same as the implementation of read-char
> in the current development version of Larceny.

While I can't speak about Larceny, I don't think is quite true
in general, since you do need an extra test or indirection
that you wouldn't otherwise need to handle the possibility
of a non-default transcoder. But we can agree one can make
this trivially small, for example by only doing the extra
test or indirection when re-filling a large buffer.

>>> * how character decoding could call iconv, with about
>>> the same performance as in C, in programs that do
>>> not need to follow a get-char with a get-u8
>> But how is the implementation supposed to know this, except
>> when something like get-string-all is called?
>
> What I am suggesting is that programs that want to
> use iconv should read the input as binary and then
> call iconv without relying on the transcoders and
> textual i/o of the draft R6RS.

But the issue isn't "programs that want to use iconv".
It's somebody wanting to write "hello world" in an
environment where their files use a different encoding
than UTF-8. Perhaps the world is moving to UTF-8, there
are still a lot of legacy systems.

Having the the default transcoder always be UTF-8 rather
than depending on the current locale is I think wrong.

> I don't think it is possible to design an i/o system
> that satisfies both of the following requirements:
>
> * arbitrary mixing of binary and textual i/o
> * efficient support for all possible transcoders

I agree. But I think the second is as important or
more so than the first.

> The i/o system of the draft R6RS isn't intended to
> support all possible transcoders. Its intent is to
> support efficient mixing of binary and textual i/o
> down to the byte level (but no lower) for a small
> set of stateless transcoders that don't require much
> lookahead.
>
> For stateful transcoders, and for transcoders that
> require a lot of lookahead, I think the right thing
> to do is to compose something like the draft's binary
> i/o with something like iconv.

That's ok, as long as you're willing to write off all
casual use of R6RS on systems where UTF-8 is not the
default encoding for files.

Perhaps that aren't very many system systems, and this is
a concern that is dwindling. But I'm curious: How many people
on this list run Scheme on systems where the default character
encoding is not UTF-8 (or ASCII)? Would you would find it acceptable
that (open-output-file "foo.txt") creates a file you can't read
without understanding about character encodings?

>> Those are the only encodings that R6RS requires. But presumably
>> an implementation can provide others. I don't see anything that
>> mentions a restriction on mixing byte/char input if a non-standard
>> codec is used, though I perhaps I missed it.
>
> The draft R6RS places no restrictions on mixing
> binary and textual input. That means there is an
> implied restriction on encodings and efficiency:
> Implementations can provide any stateless codecs
> they want in addition to those listed, but those
> implementations must figure out how to mix binary
> with textual i/o for all codecs they supply, and
> programmers must live with any inefficiencies that
> result.
>
> That may well be unreasonable, but it is what the
> draft implies.

I agree that is what the draft implies. I think it needs to be
changed.
-- 
	--Per Bothner
per_at_bothner.com   http://per.bothner.com/
Received on Tue Oct 31 2006 - 01:26:46 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC