[r6rs-discuss] perhaps i should be formal, but....

From: MichaelL_at_frogware.com <MichaelL>
Date: Wed Mar 14 23:50:19 2007

> > Bytevectors are definitely a very useful low-level addition to Scheme.
But
> > single/multi-byte strings were, I think, an unnecessary loss,
especially
> > for those who do lots of operating sytem- and library-level work.
>
> You seem to be lamenting the loss of something that never
> was.

Am I? Perhaps I'm focusing too narrowly on the Scheme implemenations I've
used. Bigloo has separate Unicode/non-Unicode (single/multi-byte)
character & string types; Chicken & Chez both have non-Unicode
(single/multi-byte) character & string types. All of them have good
foreign interface support. And single/multi-byte strings still dominate
most foreign libraries. To get a quick sense, search for "char *" vs
"wchar_t *" at
http://www.gnu.org/software/libc/manual/html_mono/libc.html. Even good old
fopen requires a single/mutli-byte string!

> > In fact, my position
> > would be even more extreme: I lament the loss of single/multi byte
strings
> > in general (which would include UTF-8). They're still useful for
low-level
> > work. In fact, they'll still be needed--think of the various Scheme to
C
> > compilers, for example, that will need a char equivalent--they just
won't
> > be standardized anymore.
>
> I don't know exactly what you mean by single/multi byte
> strings, but you indicated that they include UTF-8.
>
> I am not aware of anything in R5RS that would correspond
> to any definition of single/multi byte strings that would
> include UTF-8.

By "single/multi-byte string" I mean the equivalent of C's "char *" type
(as opposed to "wchar_t *"). On Linux, Mac, and Solaris, UTF-8 is a
supported multi-byte encoding; indeed, it is becoming the preferred
encoding. (See, for example, "What programming languages support Unicode?"
and following at http://www.cl.cam.ac.uk/~mgk25/unicode.html.) Mac's
wchar_t is UTF-16, and Linux and Solaris are UCS-4, but in fact many libc
str/mb functions work correctly on UTF-8 strings without conversion to
wchar_t. (I may be wrong about Solaris' wchar_t type; it's been a while
since I looked.)

> So what do you mean by saying they "won't
> be standardized anymore"?

Well, as I said I was probably thinking only about the Schemes I've used
over the last couple of years, and in all of them "string" was the
equivalent of "single/multi-byte string." Bad assumption. Gambit, as far
as I remember, is UCS-2, and PLT went Unicode a while ago, though I don't
remember which encoding they use.

But any Scheme with a good foreign interface will have to deal with char
strings. I presume there will be two choices: either convert strings
automatically, or provide two different character & string types. If you
care about control & performance, the first option isn't good, so you'd
want the second. And if you go the second route each Scheme will come up
with its own set of names and operations for single/multi-byte strings.
So: right now within the Schemes I've used there's agreement on
single/multi-byte strings but none on Unicode, and I'm going to end up
trading that for agreement on Unicode and none on single/multi-byte
strings.

Btw, this is of practical interest to me. These days I make a living
writing cross-platform web server software in Chez. After a long look at
all of these issues we've decided to add Unicode support to Chez via IBM's
ICU library (http://www-306.ibm.com/software/globalization/icu/index.jsp).
ICU is UTF-16, but we'll read and write UTF-8 and we'll keep our UTF-8
strings in Chez' current string type.
Received on Wed Mar 14 2007 - 23:49:47 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC