[r6rs-discuss] [Formal] R6RS must support UTF-16 encoding

From: John Cowan <cowan>
Date: Wed Nov 1 13:40:39 2006

---
This message is a formal comment which was submitted to formal-comment_at_r6rs.org, following the requirements described at: http://www.r6rs.org/process.html
---
Submitter: John Cowan
Email address: cowan_at_ccil.org
Issue type: Defect
Priority: Major
Component: I/O
Report version: 5.91
Summary:  R6RS must provide a UTF-16 codec, because UTF-16 is an
essential encoding.
R6RS implementations are currently required to support the UTF-8, Latin-1
(ISO 8859-1), UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings.
This list omits the essential UTF-16 encoding.
The difference between UTF-16 and UTF-16{BE,LE} is that in the former,
the presence of a BOM (U+FEFF) character at the beginning of the input
stream indicates the ordering of the bytes that make up each character.
The BOM is not considered part of the content.  (If no BOM is present,
the environment's default ordering is used; failing that, big-endian
order is used.)
In the UTF-16BE and UTF-16LE encodings, no BOM is permitted; an
initial U+FEFF character has its alternative semantics of zero-width
no-break space.  These encodings are far less commonly used than the
UTF-16 encoding.
In particular, the Windows operating system consistently creates UTF-16
documents in little-endian order (not UTF-16LE documents) whenever
characters must be written that are not available in the locale-dependent
encoding.  In essence, Windows systems provide two different encodings at
any one time: the "ANSI" (locale-dependent, 8-bit or 8/16-bit) encoding,
and the UTF-16 encoding.  (The MS-DOS compatibility support provides a
third encoding for use by MS-DOS programs.)  Failing to provide a UTF-16
codec will make it unnecessarily hard to process Unicode documents
generated by Windows.
In addition, UTF-16 (not UTF-16LE or UTF-16BE) is one of the two
encodings which all XML processors (parsers) are required to accept,
the other being UTF-8.  Depending on the predominant language of the
document, UTF-16 encoding may be more or less compact than UTF-8 encoding.
Failing to provide a UTF-16 codec will make a substantial range of XML
documents difficult to process.
I propose that a procedure named "utf-16-codec" be added to section
15.3.3 (p. 86).  I further propose that the codecs for the rarely used
UTF-{16,32}{BE,LE} encodings be removed.  No form of UTF-32 encoding is
in common use in I/O, though UTF-32 format is sometimes convenient for
internal use.
-- 
John Cowan     http://ccil.org/~cowan    cowan_at_ccil.org
Monday we watch-a Firefly's house, but he no come out.  He wasn't home.
Tuesday we go to the ball game, but he fool us.  He no show up.  Wednesday he
go to the ball game, and we fool him.  We no show up.  Thursday was a
double-header.  Nobody show up.  Friday it rained all day.  There was no ball
game, so we stayed home and we listened to it on-a the radio.  --Chicolini
Received on Tue Oct 31 2006 - 19:43:57 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC