[r6rs-discuss] Strings

From: Thomas Lord <lord>
Date: Sun Mar 25 18:21:20 2007

Per Bothner wrote:
> Thomas Lord wrote:
>
>> Fast pattern matching. Fast search of texts displaying
>> various kinds of sorting. Fast fingerprinting. Quite
>> a few more, I'm sure, even though those already broad
>> categories.
>
> True, though those are pretty esoteric, written as
> specialized low-level libraries.
>
> More importantly note that in most cases you don't need to
> move by N "characters" - you need to move by N *code
> units*, which would be an O(1) operation. For fast
> string matching/searching, as long as the subject
> string and the pattern use the same encoding, you can
> treat multi-code-unit characters equivalent to a sequence
> of code units: Searching a sequence of characters is
> equivalent to searching for a sequence of code units.

I'm not sure what you are trying to argue or if you just
mean to pick my brain about implementing strings, buffers,
markers, properties, overlays etc.

It is true that if a particular UTF-* internal representation
is presumed then one way to get good performance is to
translate algorithms over code points into algorithms over
encoding units. So?




>
>> With that set of assumptions (high locality of mutations and
>> a 25% waste of space) you've leant support to your formal
>> proposal for STRING-APPEND! but not any for markers.
>> On the contrary, you should be happy with a STRING-REPLACE!
>> that replaces an arbitrary substring, not necessarily preserving
>> length. Your implementation is free to use gap buffers
>> internally and, if your use assumptions hold for a given
>> application, it will perform swell -- no markers needed.
>
> The problem is with the API for specifying the substring:
> Assume the buffer contains UTF-8 or UTF-16 code units.
> If the position is specified as a character (scalar value)
> index then we need to scan (or use some kind of index) to
> map that index into a buffer offset.

In the assumptions you asked me to consider, performance
only matters for appending to a string and we can assume
that in most cases space has already been reserved for this.
Now you are revoking the assumptions you asked me to
reason under. We are going in circles.


-t



> If the position is
> specified as code unit offset, or a unspecified "magic
> cookie" then we can index directly into the buffer.
>
> An advantage of using opaque "marker" objects rather than
> indexes is that it's a way to leave the representation
> up to the implementation, so it can use whatever is most
> efficient.
Received on Sun Mar 25 2007 - 17:31:18 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC