[r6rs-discuss] Strings from Aubrey Jaffer on 2007-03-25 (r6rs-discuss.mbox)

From: Aubrey Jaffer <agj>
Date: Sun Mar 25 12:58:51 2007

| From: "Marcin 'Qrczak' Kowalczyk" <qrczak_at_knm.org.pl>
| Date: Sun, 25 Mar 2007 12:46:49 +0200
|
| Dnia 24-03-2007, sob o godzinie 13:31 -0400, MichaelL_at_frogware.com
| napisa?(a):
|
| > Summary
| > "This document attempts to make the case that it is advantageous to use
| > UTF-16 (or 16-bit Unicode strings) for text processing..."
|
| IMHO this is one of the worst mistakes Unicode is trying to make.
| It convinces people that they should not worry about characters above
| U+FFFF just because they are very rare. UTF-16 combines the worst
| aspects of UTF-8 and UTF-32.
|
| If size is important and variable width of the representation of a code
| point is acceptable, then UTF-8 is usually a better choice. If O(1)
| indexing by code points is important, then UTF-32 it better. Nobody
| wants to process texts in terms of UTF-16 code units. Nobody wants to
| have surrogate processing sprinkled around the code, and thus if one
| accepts an API which extracts variable width characters, then the API
| could as well deal with UTF-8, which is better for interoperability.
| UTF-16 makes no sense.

I agree.

There also seems to be a hidden assumption in some posts that
character alignment can only be recovered if a string is scanned from
the beginning. This is not the case.

Character alignment can be discovered from any octet within a UTF-8
encoded string. The octet which begins a code point can never be
mistaken for the subsequent octets, which always have the most
significant two bits #b10.

There are algorithms (like binary search) which access a string at
approximate locations. The asymptotic running time of such algorithms
will not be impacted by using strings coded in UTF-8.
Received on Sun Mar 25 2007 - 12:48:01 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC