[r6rs-discuss] Strings as codepoint-vectors: bad

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Per Bothner <per>
Date: Thu Mar 15 12:39:41 2007

Jason Orendorff wrote:
> Making strings vectors of 16-bit values is simple, familiar,
> speed-efficient, memory-efficient, easy to implement, and convenient
> for programmers.

Making strings vectors of 8-bits UTF-8 bytes works similarly.
You're going to have the same semantic issues.

UTF-8 is more compatible with files on Unix/Linux, and
is the default encoding for XML and the Internet.

It's basically an engineering tradeoff whether to use
UTF-8 or UTF-16 code units. For most Asian scripts
(including Arabic) the latter will be more space-efficient;
for most other languages UTF-8 will be more space-efficient.
But these days (with large video files clogging up disks
and networks) the difference is trivial. Processing (time)
efficiency will typically match whatever is most efficient.

Most code will as you say work fine even if string-ref
works on raw 8/16-bit code points. But those code
points will not be "characters". We'd have to remove
"character" functions.

-- 
	--Per Bothner
per_at_bothner.com   http://per.bothner.com/

Received on Thu Mar 15 2007 - 12:38:43 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC