[R6RS] BOM-based codecs
William D Clinger
will at ccs.neu.edu
Wed Aug 16 11:30:14 EDT 2006
Mike wrote:
> I suggest adding a codec returned by a nullary procedure
>
> utf-bom-codec
>
> that will return a codec for a meta-encoding based on the Unicode
> byte-order mark. This codec will only work for input ports, and raise
> an exception if used for an output port.
>
> For an input port with a transcoder with such a codec, the first
> attempt to read from the port will read 2, 3 or 4 bytes from the port
> that determine the actual encoding according the following table:
>
> EF BB BF UTF-8
> FE FF UTF-16be
> FF FE UTF-16le
> 00 00 FE FF UTF-32be
> FF FE 00 00 UTF-32le
>
> Will, is that what you had in mind as far as the BOM is concerned?
No. What I had in mind were codecs for UTF-16 and UTF-32,
as summarized at http://www.unicode.org/faq/utf_bom.html#37 .
What you are proposing is a non-standard but possibly useful
heuristic. (I have reservations about requiring the UTF-8
heuristic, but just because I haven't seen it before.)
The utf-bom-codec you are proposing would not do away with
the need for the utf-16-codec and utf-32-codec that I have
been advocating. Those codecs differ from the utf-bom-codec
you are proposing in that they:
* are Unicode standards
* default in the standard way when there is no byte order mark
* never default to an unexpected width
Will
More information about the R6RS
mailing list