[r6rs-discuss] [Formal] String positions and string slices

From: John Cowan <cowan>
Date: Fri, 6 Apr 2007 22:16:09 -0400

---
This message is a formal comment which was submitted to formal-comment at r6rs.org, following the requirements described at: http://www.r6rs.org/process.html
---
Submitter: John Cowan
Email address: cowan at ccil.org
Issue type: Enhancement
Priority: Major
Component: Base Library
Report version: 5.92
Summary:  Proposes string positions and string slices to solve problems
          with O(n) Unicode string operations.
This proposal is designed to provide a set of procedures for R6RS
strings that will preserve the historic linear behavior of most R5RS
implementations, while still providing the Unicode scalar value
semantics of R6RS characters.  These procedures are specifically
designed to be equally applicable to representations using fixed-width
arrays, variable-width arrays, and trees of various types.
Internally, characters in strings may not be laid out in uniform-width
cells in memory.  Consequently, random access to characters by the natural
number of their position in the sequence may not run in constant time.
Positions in strings are therefore identified by opaque string-positions.
Following the convention of string editors, string-positions identify not
characters in the string but the positions between them.  String-positions
are *not* necessarily disjoint from existing Scheme data types; they may
be identical to exact non-negative integers, for example.
 
Each string-position is associated with a set of strings into which
it is valid.  Strings may be "sliced up" and seen in parts with the
string-slice procedure or its derivatives; string-positions that were
valid in a string are also valid at corresponding positions in slices
of the string, because slices of the string still refer to the same
underlying string structure.  Because operating on a portion of string is
such a common operation, the running time of string-slice is guaranteed
to be sublinear in the number of characters within that slice of string;
programmers are encouraged to use string-slice frequently.
IMPORTANT NOTE:  Any use of string-set! on a string invalidates all
string positions associated with those string; it is an error to use
any of them after that.
 
There is no method by which to find the string associated with a position,
primarily because there is no such unique string, but rather a set
of strings, some of which are slices of others; and in part to permit
efficient implementations of string in which positions are represented
as offsets into the string storage that fit in machine registers and
require no storage on the heap.  Contrariwise, there is no guarantee that
string positions are exact non-negative integers, so implementations
less focussed on run-time performance may provide enough information to
verify the correct use of positions.
The following new procedures are proposed for the base library:
  (string-start-position <string>) -> position
    Returns the first position in <string>, before which there are no
    characters.
  (string-end-position <string>) -> position
    Returns the last position in <string>, after which there are no
    characters.
  (position-in-string? <position> <string>) -> boolean
    Returns true if <position> is a valid position in <string>, and false
    if not.  <Position> must be a string-position in either case.
  (string-position=? <position-a> <position-b> <string>) -> boolean
    Returns true if <position-a> and <position-b> identify the same
    position in <string>.  Returns false if they identify different
    positions.  It is an error if either of <position-a> or <position-b>
    is not a valid position in <string>.
  (string-position<?  <position-a> <position-b> <string>) -> boolean
  (string-position>=? <position-a> <position-b> <string>) -> boolean
  (string-position>?  <position-a> <position-b> <string>) -> boolean
  (string-position<=? <position-a> <position-b> <string>) -> boolean
    Procedures imposing a total ordering on string-positions.  It is an
    error if <position-a> and <position-b> are not valid positions into
    <string>.
  (string-forward <position> <string> [<count>]) -> position
    Returns a string-position for the position in <string> that is <count>
    characters after <position>, or false if there are fewer than
    <count> characters after <position> in <string>.  If <count> is not
    supplied, its default value is 1.  It is an error if <position> is
    not a valid position in <string>.
  (string-backward <position> <string> [<count>]) -> position
    Returns a string-position for the position in <string> that is <count>
    characters before <position>, or false if there are fewer than
    <count> characters before <position> in <string>.  If <count> is not
    supplied, its default value is 1.  It is an error if <position> is
    not a valid position in <string>.
  (string-slice <string> <start-position> <end-position>) -> string
    Returns a string that contains the sequence of characters in <string>
    between <start-position> and <end-position>.  This slice of the string
    may refer to storage shared by <string>.  The running time of
    string-slice on average must be sublinear in the number of characters
    between <start-position> and <end-position>.
    For example, string-slice may run in (amortized) constant time, if a
    string is a triple of internal storage, a start bound, and an end
    bound, and string-slice need only construct a new triple with
    tighter bounds; or string-slice may run in logarithmic time, if a
    string is structured as a tree of content.  string-slice may *not*
    simply copy all of the characters in <string>; programmers may rely
    on its performance, and should not be afraid to use it frequently.
  (string-prefix <string> <end-position>) -> string
  (string-suffix <string> <start-position>) -> string
    string-prefix returns a slice of <string> that contains the sequence
    of characters before <end-position>.  string-suffix returns a slice
    of <string> that contains the sequence of characters after
    <start-position>.
      (define (string-prefix string end-position)
        (string-slice string (string-start-position string) end-position))
      (define (string-suffix string start-position)
        (string-slice string start-position (string-end-position string)))
  Because a string comprises a sequence of characters, each character
  in the sequence may be assigned an index (an exact and non-negative number)
  as used in the string-ref and string-set! procedures.  There is an
  isomorphism between these indices and string-positions; the following
  procedures deal with this isomorphism.
  (string-position->index <position> <string>) -> index
    Returns the number of characters in <string> that precede <position>,
    or the number of iterated applications of string-forward to the
    starting position in <string> necessary to find a position equal to
    <position>.  This operation may run in time proportional to the
    value of the index.  It is an error if <position> is not a valid
    position in <string>.
      (define (string-position->index position string)
        (do ((index 0 (+ index 1))
             (position* (string-start-position string)
                      (string-forward position* string)))
            ((string-position=? position* position)
             index)))
  (index->string-position <index> <string>) -> position
    Returns a string-position after <index> characters in <string>, or
    iteratively applies string-forward <index> times to the starting
    position in <string>.  This operation may run in time proportional to
    the value of the index.  It is an error if <index> exceeds the
    number of characters in <string>.
      (define (index->string-position index string)
        (do ((index index (- index 1))
             (position (string-start-position string)
                     (string-forward position string)))
            ((zero? index)
             position)))
  (string-position-difference <position-a> <position-b> <string>)
      -> integer difference
    Returns the number of characters after <position-b> and before
    <position-a>.  If <position-a> precedes <position-b> in <string>, this is
    the additive inverse of the number of characters after <position-a>
    and before <position-b>, i.e. (- (string-position-difference <position-b>
    <position-a> <string>)).
    This is provided separately so that implementations may provide a
    more efficient implementation than counting up with string-forward.
The following procedures provide a portable interface to searching and
functional editing.  They are proposed as the (r6rs string) library, but
could be moved to a SRFI instead.
  (string-search-forward <string> <pattern>)
      -> [match-start-position match-end-position] or [#F #F]
  (string-search-backward <string> <pattern>)
      -> [match-start-position match-end-position] or [#F #F]
    string-search-forward searches forward through <string> for the first
    occurrence of the pattern; string-search-backward searches backward
    through <string> for the last occurrence of the pattern.  If a match
    is found, returns two positions identifying the starting and ending
    positions of the match in <string>; if no match is found, returns
    (values #f #f).  If <pattern> is not a string, the behavior is
    implementation-dependent.
  (string-search-forward/ci <string> <pattern>)
      -> [match-start-position match-end-position] or [#F #F]
  (string-search-backward/ci <string> <pattern>)
      -> [match-start-position match-end-position] or [#F #F]
    Case-insensitive variants of the preceding two procedures.
  (string-append <string> ...) -> string
  (string-concatenate <list-of-strings>) -> string
  (string-join <list-of-strings> <prefix> <infix> <suffix> [<empty>])
      -> string
    These return concatenations of string.  string-append returns the
    concatenation of each <string> ....  string-concatenate returns the
    concatenation of each element of <list-of-strings>.  string-join
    returns <empty>, or an empty string, if all elements of
    <list-of-strings> are empty; or a string composed by concatenating
    <prefix>, each of the elements of <list-of-strings> with <infix>
    between each pair of consecutive elements, and <suffix>.
  (string-replace <string> <start-position> <end-position> <replacement-string>)
      -> string
    Returns a string composed of the sequence of characters in <string>
    before <start-position>, followed by the sequence of characters in
    <replacement-string>, followed by the sequence of characters in
    <string> after <end-position>.  It is an error if either of
    <start-position> and <end-position> is not a valid position in <string>.
  (string-insert <string> <position> <insertion-string>) -> string
    Returns a string composed of the sequence of characters in <string>
    before <position>, followed by the sequence of characters in
    <insertion-string>, followed by the sequence of characters in
    <string> after <position>.  It is an error if <position> is not a valid
    position in <string>.
  (string-delete <string> <start-position> <end-position>) ->
      [string deleted-string position]
    Returns three values: a string composed by deleting the sequence of
    characters in <string> between <start-position> and <end-position>, as
    if with string-delete; the string that was deleted; and a position
    identifying the position in the first string where the second string
    was deleted.
-- 
John Cowan    http://ccil.org/~cowan    cowan at ccil.org
[T]here is a Darwinian explanation for the refusal to accept Darwin.
Given the very pessimistic conclusions about moral purpose to which his
theory drives us, and given the importance of a sense of moral purpose
in helping us cope with life, a refusal to believe Darwin's theory may
have important survival value. --Ian Johnston
Received on Fri Apr 06 2007 - 22:16:09 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC