[r6rs-discuss] Strings

From: Aubrey Jaffer <agj>
Date: Sun Mar 25 12:58:51 2007

 | From: "Marcin 'Qrczak' Kowalczyk" <qrczak_at_knm.org.pl>
 | Date: Sun, 25 Mar 2007 12:46:49 +0200
 |
 | Dnia 24-03-2007, sob o godzinie 13:31 -0400, MichaelL_at_frogware.com
 | napisa?(a):
 |
 | > Summary
 | > "This document attempts to make the case that it is advantageous to use
 | > UTF-16 (or 16-bit Unicode strings) for text processing..."
 |
 | IMHO this is one of the worst mistakes Unicode is trying to make.
 | It convinces people that they should not worry about characters above
 | U+FFFF just because they are very rare. UTF-16 combines the worst
 | aspects of UTF-8 and UTF-32.
 |
 | If size is important and variable width of the representation of a code
 | point is acceptable, then UTF-8 is usually a better choice. If O(1)
 | indexing by code points is important, then UTF-32 it better. Nobody
 | wants to process texts in terms of UTF-16 code units. Nobody wants to
 | have surrogate processing sprinkled around the code, and thus if one
 | accepts an API which extracts variable width characters, then the API
 | could as well deal with UTF-8, which is better for interoperability.
 | UTF-16 makes no sense.

I agree.

There also seems to be a hidden assumption in some posts that
character alignment can only be recovered if a string is scanned from
the beginning. This is not the case.

Character alignment can be discovered from any octet within a UTF-8
encoded string. The octet which begins a code point can never be
mistaken for the subsequent octets, which always have the most
significant two bits #b10.

There are algorithms (like binary search) which access a string at
approximate locations. The asymptotic running time of such algorithms
will not be impacted by using strings coded in UTF-8.
Received on Sun Mar 25 2007 - 12:48:01 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC