Thomas Lord wrote:
> By which I take you to mean I should show you uses for random
> access into strings? You are talking about using
> markers instead of integer string indexes, right?
Right.
> Fast pattern matching. Fast search of texts displaying
> various kinds of sorting. Fast fingerprinting. Quite
> a few more, I'm sure, even though those already broad
> categories.
True, though those are pretty esoteric, written as
specialized low-level libraries.
More importantly note that in most cases you don't need to
move by N "characters" - you need to move by N *code
units*, which would be an O(1) operation. For fast
string matching/searching, as long as the subject
string and the pattern use the same encoding, you can
treat multi-code-unit characters equivalent to a sequence
of code units: Searching a sequence of characters is
equivalent to searching for a sequence of code units.
> With that set of assumptions (high locality of mutations and
> a 25% waste of space) you've leant support to your formal
> proposal for STRING-APPEND! but not any for markers.
> On the contrary, you should be happy with a STRING-REPLACE!
> that replaces an arbitrary substring, not necessarily preserving
> length. Your implementation is free to use gap buffers
> internally and, if your use assumptions hold for a given
> application, it will perform swell -- no markers needed.
The problem is with the API for specifying the substring:
Assume the buffer contains UTF-8 or UTF-16 code units.
If the position is specified as a character (scalar value)
index then we need to scan (or use some kind of index) to
map that index into a buffer offset. If the position is
specified as code unit offset, or a unspecified "magic
cookie" then we can index directly into the buffer.
An advantage of using opaque "marker" objects rather than
indexes is that it's a way to leave the representation
up to the implementation, so it can use whatever is most
efficient.
--
--Per Bothner
per_at_bothner.com http://per.bothner.com/
Received on Sun Mar 25 2007 - 16:58:45 UTC