[R6RS] BOM-based codecs

Wed Aug 16 11:30:14 EDT 2006

Mike wrote:
> I suggest adding a codec returned by a nullary procedure
> 
> utf-bom-codec
> 
> that will return a codec for a meta-encoding based on the Unicode
> byte-order mark.  This codec will only work for input ports, and raise
> an exception if used for an output port.
> 
> For an input port with a transcoder with such a codec, the first
> attempt to read from the port will read 2, 3 or 4 bytes from the port
> that determine the actual encoding according the following table:
> 
> EF BB BF    UTF-8
> FE FF       UTF-16be
> FF FE       UTF-16le
> 00 00 FE FF UTF-32be
> FF FE 00 00 UTF-32le
> 
> Will, is that what you had in mind as far as the BOM is concerned?

No.  What I had in mind were codecs for UTF-16 and UTF-32,
as summarized at http://www.unicode.org/faq/utf_bom.html#37 .
What you are proposing is a non-standard but possibly useful
heuristic.  (I have reservations about requiring the UTF-8
heuristic, but just because I haven't seen it before.)

The utf-bom-codec you are proposing would not do away with
the need for the utf-16-codec and utf-32-codec that I have
been advocating.  Those codecs differ from the utf-bom-codec
you are proposing in that they:

 * are Unicode standards
 * default in the standard way when there is no byte order mark
 * never default to an unexpected width

Will