[r6rs-discuss] Why lexers can be simpler when restricted to ASCII

From: Lars T Hansen <lth>
Date: Mon, 23 Apr 2007 21:07:22 +0200

On 4/23/07, Alan Watson <alan at alan-watson.org> wrote:
> In formal comment 231, I stated:
>
> "Many current Schemes have lexers written for ASCII (or Latin-1)
> character sets. Conversion of these lexers to the new standard would be
> easier if the report allowed inline hex escapes to appear anywhere in
> Scheme code."
>
> The editors replied:
>
> "It is unclear why converting the lexers would be significantly simpler
> through this change"
>
> Let me explain my original opinion. Many Schemes currently have lexers
> written in C using "char". These need converting to "long" to handle
> Unicode. Furthermore, table-driven approaches are practical for ASCII
> (128 values), but not practical for Unicode (roughly 2^24 values).
>
> In case that isn't clear enough: My Scheme uses flex for its lexer. I
> cannot see how to simply convert it to accept Unicode. I think I will
> have to dump flex and implement a new lexer by hand.

Normally you can make Flex work on Unicode by converting the input to
UTF-8 before lexing it, having first rewritten the flex input to work
on UTF-8. It's not exactly pretty, but (speaking from experience) if
you don't mind accepting a superset of the valid characters for
identifiers it's not bad at all. State-dependent recognizers in the
flex input are very helpful here.

--lars
Received on Mon Apr 23 2007 - 15:07:22 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC