[R6RS] Source code encoding
Marc Feeley
feeley
Tue Mar 15 08:33:19 EST 2005
I understand your arguments, but don't agree with your conclusions.
> Marc> Now you are advocating for using UTF-8 only. Why not allow UTF-16 +
> Marc> BOM also, since it does not conflict in any way with UTF-8 and UTF-16
> Marc> + BOM is the norm on Windows for encoding Unicode text files? What is
> Marc> the downside of supporting both of these popular Unicode encodings?
>
> - Because there are standard decoders out there where you can say
> "UTF-xx + BOM" where the auto-detection wouldn't work in the setup
> you describe.
Which decoders are you refering to? Are these really widespread? I
suspect it might be easiest to ask the developers of these decoders to
also have a mode to autodetect between UTF-8 and UTF-16 + BOM, since
that is possible and reasonable in itself.
> - Because, if we allow two different concrete encodings now, we might
> want to add a third one in the future, and it's not clear that
> leaving out the BOM on one of them where it's actually allowed will
> scale.
But an important reason for leaving out the BOM for UTF-8 is to allow
shell scripts. This alone precludes using BOMs with UTF-8, so we have
to give up on autodetection between all possible Unicode encodings.
> - Because this auto-detection based on a tag that isn't there always
> makes me feel queasy, and doesn't seem very robust.
I don't understand. It is possible to distinguish (with no ambiguities)
the following encodings
- UTF-16 + BOM
- UTF-8 + BOM
- UTF-8
So why do you say it is not robust?
> - Because the perceived (by me) complexity.
You mean implementation complexity? How about:
(define (determine-encoding port)
(case (peek-byte port)
((#xFE #xFF) 'UTF-16+BOM)
((#xEF) 'UTF-8+BOM)
(else 'UTF-8)))
Marc
More information about the R6RS
mailing list