[Python-Dev] Better text processing support in py2k?

Thu, 30 Dec 1999 12:52:50 +0100

Tim Peters wrote:
> 
> [Skip Montanaro, wants nicer text facilities]
> > While I don't want to turn Python into Perl, I would like to see
> > it do a better job of what most people probably use the language
> > for.  Here is a very short list of things I think need attention:
> >
> >     1. [*A* clear way to do memory- and time-efficient textfile
> >         input]
>
> ...
> 
> The Python QIO extension module is much easier to port but less compatible
> (it doesn't use stdio, so QIO-opened files don't play well with others) and
> slower (although that's likely repairable -- he's got two passes over the
> buffer where one hairier pass should suffice).

What is QIO ?

> > Depending how far people want to go with things, adding some
> > language syntax to support regular expressions might be in order.
> > ...
> >     3. I've not yet used it, but I am told the pattern matching in
> >        Marc-Andre Lemburg's mxTextTools
> >       (http://starship.python.net/crew/lemburg/)
> >        is both powerful and efficient (though it certainly appears
> >        complex).  Perhaps it deserves consideration for
> >        incorporation into the core Python distribution.
> 
> It's not complex, it's complicated -- and *that's* what makes it un-Pythonic
> <wink>.  Tony Ibbs has written a friendly wrapper around mxTextTools that
> suppresses much of the non-essential complication.  OTOH, if you go into
> this with a regexp mindset, it will run much slower than a real regexp
> package, because the bulk of the latter is devoted to doing optimization;
> mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls
> if you e.g. try to implement naive backtracking).

All true. mxTextTools provides the tools, not the magic. But this
is also its strength: you can optimize the hell out of your particular
parsing requirement without having to think about how the RE optimizer
works.

> You should go to the REBOL site and look at the description of REBOL's PARSE
> verb in the FAQ ... mumble, mumble ... at
> 
>     http://www.rebol.com/faq.html#11550948
> 
> Here's an example pulled from that page (this is a REBOL code fragment):
> 
>     digit: charset "0123456789"
>     expr: [term ["+" | "-"] expr | term]
>     term: [factor ["*" | "/"] term | factor]
>     factor: [primary "**" factor | primary]
>     primary: [value | "(" expr ")"]
>     value: [digit value | digit]
> 
>     parse "1 + 2 ** 9" expr
> 
> There hasn't been a pattern scheme this clean, convenient or powerful since
> SNOBOL4.  It exploits REBOL's Forth-like (lack of!) syntax, and
> Smalltalk-like penchant for passing around thunks (anonymous closures --
> "[...]" in REBOL builds a lexically-scoped entity called "a block", which
> can be treated as code (executed) or data (manipulated like a Python list)
> at will).

Looks nice indeed, but how does executable code fit into
that definition ? (mxTextTools allows you to write your own
parsing elements in Python, BTW; it should be possible to
use those mechanisms to achieve a similar intergration.)

> ...
>
> BTW, the mxTextTools engine could be used to get blazing implementations of
> the primary Searcher methods (it excels at simple analysis).  OTOH, making
> lots of calls to analyze short strings is slow.

That's why mxTextTools converts these search idioms into byte codes
which it executes at C level. Some future version will even "precompile"
the tuple input and then omit the type checks during the search...
that should give another noticeable speedup. Note that recursion
etc. can be done at C level too -- Python function calls are not
needed.

> The only clean solutions to
> that are Perl's and Icon's (build everyting into one language so the
> compiler can optimize stuff away), and REBOL's (make no distinction between
> code and data, so that code can be analyzed & optimized at runtime -- and
> build the entire implementation around making closures and calls
> supernaturally fast).

Just for kicks, here is the mysplit() function using mxTextTools:

from mx.TextTools import *

table = (
    # Match all whitespace
    (None,AllInSet,whitespace_set,+1),
    # Match and tag all non-whitespace
    ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
    # Loop until EOF
    (None,EOF,Here,-2),
    )

def mysplit(text):

    return tag(text,table)[1]

The timings:
 mysplit: 5.84 sec.
 string.split: 3.62 sec.

Note that you can customize the above to split text at any
character set you like, not just whitespace... without
compiling or writing C code. The function mx.TextTools.setsplit()
provides this functionality as pure C function.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                            Get ready to party !
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/