[Python-Dev] Better text processing support in py2k?

Tim Peters tim_one@email.msn.com
Thu, 30 Dec 1999 22:35:18 -0500


[M.-A. Lemburg]
> What is QIO ?

See DejaNews (I don't save URLs).  "Quick" line-oriented text input adapted
from INN.  Someone rewrote that as a Python extension module.

>>     http://www.rebol.com/faq.html#11550948

> Looks nice indeed, but how does executable code fit into
> that definition ?

See the URL above I didn't save <wink>.  PARSE's "pattern" argument is a
block.  Blocks can be (& often are) nested.  Whether any given block is code
or data is all the same to REBOL, so passing nested code blocks in PARSE's
pattern argument is easy.  Because blocks are lexically scoped, assignments
(etc) inside a block are (well, can be) visible to its context; etc.  It's a
very Lispish approach.  REBOL is essentially Scheme under the covers, but
with syntax much more like Forth's (whitespace-separated strings of
arbitrary non-whitespace characters, with few pre-assigned meanings or
restrictions -- in fact, it's impossible for a compiler to determine where a
REBOL function call begins or ends!  can't be known until runtime).

> (mxTextTools allows you to write your own parsing elements
> in Python, BTW; it should be possible to use those mechanisms
> to achieve a similar intergration.)

It can't capture the flavor -- although I don't know that it needs to
<wink>.  There's no distinction between "the pattern language" and "the
computational language" in REBOL or Icon, and it's hard to explain what a
maddening distinction that can be once you've lived without it.  mxTextTools
embedding would feel more like Icon, where the matching engine is fully
exposed to the programmer (REBOL hides it, allowing only "approved"
interactions).

>> OTOH, making lots of calls to analyze short strings is slow.

> That's why mxTextTools converts these search idioms into byte
> codes which it executes at C level. Some future version will
> even "precompile" the tuple input and then omit the type checks
> during the search...that should give another noticeable speedup.
> Note that recursion etc. can be done at C level too -- Python
> function calls are not needed.

That's also the curse of having distinct languages; e.g., Python already had
recursion, but you needed to reimplement it in a different way with
different syntax and different rules in your pattern language.  In Icon etc,
there's no difference between a recursive pattern and a recursive function,
except in *what* it computes.  The machinery is all the same, and both more
powerful and easier to learn because of that.

> ...
> Just for kicks, here is the mysplit() function using mxTextTools:
>
> from mx.TextTools import *
>
> table = (
>     # Match all whitespace
>     (None,AllInSet,whitespace_set,+1),
>     # Match and tag all non-whitespace
>     ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
>     # Loop until EOF
>     (None,EOF,Here,-2),
>     )
>
> def mysplit(text):
>
>     return tag(text,table)[1]
>
> The timings:
>  mysplit: 5.84 sec.
>  string.split: 3.62 sec.
>
> Note that you can customize the above to split text at any
> character set you like, not just whitespace... without
> compiling or writing C code.

That's equally true of the example I posted <wink>.  Now what if I wanted to
stop splitting right after I find a keyword, recognized as such because it's
a key in some passed-in dictionary?  In my example, I make an obvious local
code change, from

    while s.notmany(white):  # consume non-whitespace
        result.append(s.get_match())
        s.many(white)

to

    while s.notmany(white):  # consume non-whitespace
        word = s.get_match()
        result.append(word)
        if dictionary.has_key(word):
            break
        s.many(white)

What does it do to your example?  Or what if the target string isn't "a
string" (the code I posted only assumes the "str" object responds to
indexing and slicing -- any buffer object is fine -- so my example doesn't
change at all)?  Or what if you need to pass the tokens on as they're found,
pipeline style?  Etc.  This is why I do complex string processing in Icon
<0.9 wink>.

OTOH, at what it does well, mxTextTools runs quicker than Icon.  Its biggest
problem has always been that e.g. nobody knows what the hell

     (None,EOF,Here,-2),

*means* at first glance -- or third <wink>.

an-extreme-on-the-transparency-vs-speed-curve-ly y'rs  - tim