[r6rs-discuss] Stateful codecs and inefficient transcoding from John Cowan on 2006-10-31 (r6rs-discuss.mbox)

From: John Cowan <cowan>
Date: Tue Oct 31 02:02:13 2006

William D Clinger scripsit:

> Although the draft R6RS does not have your hypothetical utf-16-codec
> that relies on an initial BOM to select the endianness,

I will be filing a formal comment objecting to this. The standard
encoding of Unicode files (that is, files which may contain any Unicode
character) on Windows systems is UTF-16; neither UTF-16LE nor UTF-16BE
is customarily used there. In addition, UTF-16 is one of the two
encodings (the other being UTF-8) which all XML processors are required
to understand.

> the implementation could peek at the first two bytes, decide whether
> to use UTF-16BE or UTF-16LE, and could install one of those two as
> the transcoder associated with the port.

That's not quite right. The implementation must peek
at the first two bytes, and if:

1) they are FE FF, they must be consumed and UTF-16BE installed;

2) they are FF FE, they must be consumed and UTF-16LE installed;

3) otherwise, the environment must be interrogated to see if
   UTF-16 is by default in little-endian order, and if so,
   UTF-16LE must be installed without consuming the two bytes;

4) otherwise, UTF-16BE must be installed without consuming the
   two bytes.

The point here is that neither UTF-16LE nor UTF-16BE encodings are
permitted to use a BOM; if a U+FEFF character appears, it is the
substantive character ZERO-WIDTH NON-BREAKING SPACE. In the UTF-16
encoding, U+FEFF is a BOM at the beginning of a file but a ZWNBSP
elsewhere.

> I don't like it much myself, but not for the two reasons you gave.
> For one thing, I harbor a prejudice against stateful encodings;

Be that as it may, they are prominent in both pre-Unicode and Unicode
systems.

> Furthermore I am told that some important file formats,
> e.g. XML, use several different textual encodings.

You are told wrongly.

XML *documents* may comprise multiple files (external entities), but
each file is in one and only one encoding, indicated thus:

1) Files in UTF-16 MUST begin with a BOM and MAY follow the BOM
   with an internal encoding declaration.

2) Files in UTF-8 MAY begin with a BOM which MAY be followed by an
   internal encoding declaration.

3) Files in other Unicode encodings MAY begin with a BOM and MUST
   be followed by an internal encoding declaration.

4) Files in non-Unicode encodings MUST begin with an internal
   encoding declaration.

Note that because ASCII encoding is a subset of UTF-8 encoding, ASCII
files do not require an internal encoding declaration.

In order to read the internal encoding declaration, XML processors must
read each file at the byte level. They then have a choice between
switching to character reading or restarting at the beginning with
character reading. An XML file MUST NOT contain bytes that are not
permitted in the character encoding of the file.

-- 
Well, I have news for our current leaders       John Cowan
and the leaders of tomorrow: the Bill of        cowan_at_ccil.org
Rights is not a frivolous luxury, in force      http://www.ccil.org/~cowan
only during times of peace and prosperity.
We don't just push it to the side when the going gets tough.  --Molly Ivins

Received on Tue Oct 31 2006 - 02:02:04 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC