[r6rs-discuss] Why lexers can be simpler when restricted to ASCII

From: Thomas Lord <lord>
Date: Mon, 23 Apr 2007 10:51:10 -0700

I have some suggestions, Alan, but of course they are fundamentally
greedy/self-interested ("Hey, give us some free labor!").

Flex, which you are using, is nicely polished but, ultimately, it
ain't rocket science. DFA minimization isn't that hard and it's
a fun programming puzzle.

Tables for Unicode Scheme lexers under the draft aren't all that
impractical. The key insight is that such tables are mostly
homogenous -- a sparse array data structure can serve well.

You might consider writing a lexical analyzer generator from
scratch. It's not so crazy, really. If you make or take a flyweight
R5 implementation, you can even have a nice bootstrapping
path if you build the analyzer in Scheme.

I had some experience going down this path a ways back.
It's fun. My work then wasn't timely but the environment may
be more receptive these days. If you get into regular expression
hacking, it's just a hop-skip-and-a-jump to get to interesting
parsers.

It could be interesting to have a lex/parser-generator that can
not only generate C/C++/Java/... but can also drive Scheme
tools. I'm sure there are examples out there but I think there's
room for a tight, practical expression of such tools.

One application I found for such tools was in checking
mostly-statically-checkable invariants in a fork off of SCM.
For example, in SCM, certain invariants about the heap
data structure require that certain macros and functions are
invoked only from within explicitly delimited critical
sections. Turns out that, at the time, lots of programmer mistakes
violated those invariants. It also turned out that even a crude
C parser and crude flow analysis was "good enough" to
catch dozens of those errors automatically. Similar analysis
would have useful application in, for example, maintaining an
OS kernel like Linux.

Probably only take you a few days :-)

-t


Alan Watson wrote:
> In formal comment 231, I stated:
>
> "Many current Schemes have lexers written for ASCII (or Latin-1)
> character sets. Conversion of these lexers to the new standard would be
> easier if the report allowed inline hex escapes to appear anywhere in
> Scheme code."
>
> The editors replied:
>
> "It is unclear why converting the lexers would be significantly simpler
> through this change"
>
> Let me explain my original opinion. Many Schemes currently have lexers
> written in C using "char". These need converting to "long" to handle
> Unicode. Furthermore, table-driven approaches are practical for ASCII
> (128 values), but not practical for Unicode (roughly 2^24 values).
>
> In case that isn't clear enough: My Scheme uses flex for its lexer. I
> cannot see how to simply convert it to accept Unicode. I think I will
> have to dump flex and implement a new lexer by hand.
>
> Regards,
>
> Alan
>
> _______________________________________________
> r6rs-discuss mailing list
> r6rs-discuss at lists.r6rs.org
> http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
>
>
Received on Mon Apr 23 2007 - 13:51:10 UTC

This archive was generated by hypermail 2.3.0 : Wed Oct 23 2024 - 09:15:01 UTC