[R6RS] Source code encoding
Marc Feeley
feeley
Mon Mar 7 15:07:56 EST 2005
> Marc> Something's strange here. First of all there is no need for a BOM in
> Marc> UTF-8 because UTF-8 is a sequence of bytes. [...]
>
> For an explanation, check
>
> http://www.unicode.org/faq/utf_bom.html#BOM
But this reference also says that adding a BOM on UTF-8 is only useful
as a signature to disambiguate it from some encodings like UTF-32 and
Latin-1, but we would not use these encodings. Moreover have you read
this part:
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8
form)? If yes, then can I still assume the remaining UTF-8
bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
to the endianness of the byte stream. UTF-8 always has the same
byte order. An initial BOM is only used as a signature ? an
indication that an otherwise unmarked text file is in
UTF-8. Note that some recipients of UTF-8 encoded data do not
expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol
or file format that expects specific ASCII characters at the
beginning, such as the use of "#!" of at the beginning of Unix
shell scripts.
It would mean that you can't use a UTF-8 encoded Scheme source
file as a shell script. That would be bad.
I maintain that allowing UTF-16 + BOM and UTF-8 is a good compromise
(it covers the two most popular Unicode file encodings, allows shell
scripts, plain ASCII files need not be changed, and a wide range of
editors can be used). We could however add that an initial BOM
on a UTF-8 encoded file is ignored.
Marc
More information about the R6RS
mailing list