[r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

From: Abdulaziz Ghuloum <aghuloum>
Date: Mon Mar 19 21:13:51 2007

On Mar 19, 2007, at 8:17 PM, MichaelL_at_frogware.com wrote:

> UTF-8 and UTF-16 require one or more code units to represent a given
> scalar value. Since the number of code units depends on the scalar
> value
> being encoded there's no algorithm that maps the i'th scalar value
> to the
> j'th code unit. If you want the i'th scalar value in a UTF-8 or UTF-16
> string you have to search for it. And that, of course, is what
> string-ref
> is, a request for the i'th scalar value (returned as a character).

 From what I understand, UTF-8, UTF-16, and UTF-32 are interchange
formats.
Unicode text encoded in any one of the formats can be converted to
another
without loss of information (right?). Moreover, the internal
representation
of strings does not have to match the external representation. For
example,
you can read a UTF-32 encoded file into a variable-length buffer to save
some space (sometimes); or alternatively, you can read a UTF-8
encoded file
into a fixed-length buffer to save time on random-access (sometimes).

Is the following a valid summary of the issue?

   The existence of string-ref and string-set! operations seems to imply
   that a variable-length internal representation is not an option and
   a fixed-length representation wastes space and is therefore
inefficient
   (mostly in an ascii-centered world).

Aziz,,,
Received on Mon Mar 19 2007 - 21:13:35 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC