[r6rs-discuss] design thumb-axioms for strings

From: Thomas Lord <lord>
Date: Wed, 29 Aug 2007 14:11:22 -0700

Some rule-of-thumb "axioms" for implementing strings:

For I/O and good interaction with VM, string data in memory
needs a high locality (bursts of characters contiguous in memory)
but: the bursts don't have to be larger than a VM page to obtain
the full benefits.

Substring extraction and concatenation should be O(1) operations.
Then it would follow that mutation and reference are O(1). You
can't do that perfectly, though.

In a recursive algorithm, optimizing the leaf steps (base case steps)
is critical. The contiguous bursts of characters should all use
fixed-width encodings.

A key trick is that not all bursts of characters have to use the
same fixed-width encoding. The combinatorics are a little
hairy but with only three fixed-width encodings (for 8-bit,
16-bit, and 32-bit code values) you can satisfy all of the above,
swimmingly. (Notable objection: Stallman tells me thinks
the combinatorics here are *too* hairy to be worth the effort.)

The combinatorics get a little harrier but with just one additional
fixed-width encoding you can handle a character set of unbounded
size (such as Unicode when taking composite characters as your
primitive notions of characters).

Dillenger keeps hinting that that fourth encoding might be all
anyone ever needs. I don't buy it but I can't prove I'm right.

Boehm is probably the author of the cites everyone should be
including. He gave us enough rope to hang ourselves.

-t
Received on Wed Aug 29 2007 - 17:11:22 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC