[Python-Dev] Better text processing support in py2k?

Thu, 30 Dec 1999 01:08:58 -0500

[Skip Montanaro, wants nicer text facilities]
> ...
> I rather suspect that more people use Python for some sort of
> text processing than any other single application domain.

Hmm.  You're probably right, but I'm an exception.

> Python should be good at it.

And I guess I'm an exception mostly *because* Perl is better at easy text
crunching and Icon is better at hard text-crunching -- that is, I use the
right tool for the job <wink>.

> While I don't want to turn Python into Perl, I would like to see
> it do a better job of what most people probably use the language
> for.  Here is a very short list of things I think need attention:
>
>     1. [*A* clear way to do memory- and time-efficient textfile
>         input]

I agree, but unsure how to fix it.  The best way to write this now is

    # f is some open file object.
    while 1:
        lines = f.readlines(BUFSIZE)
        if not lines:
            break
        for line in lines:
            process(line)

and it's not something anyone figures out on their own -- or enjoys typing
or explaining afterwards.

Perl gets its line-at-a-time speed by peeking and poking C FILE structs
directly in compiler- and platform-specific ways -- ways that vendors
*should* have done in their own fgets implementations, but almost never do.
I have no idea whether it works well with Perl's nascent notions of
threading, but in the absence of that "the system" doesn't know Perl is
cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one
line at a time -- even mixing in C-level ungetc calls works (well, sometimes
<0.1 wink -- they don't always peek and poke enough fields>)).

The Python QIO extension module is much easier to port but less compatible
(it doesn't use stdio, so QIO-opened files don't play well with others) and
slower (although that's likely repairable -- he's got two passes over the
buffer where one hairier pass should suffice).

>     2. The re module needs to be sped up, if not to catch up with
>        Perl, then to catch up with the deprecated regex module.

The irony here is that the re engine is very often unboundedly faster than
the regex engine -- provided you're chewing over large strings.  Some tests
/F ran showed that the length-independent *overhead* of invoking re is about
10x higher than for regex.  Presumably the bulk of that is due to re.py,
i.e. that you get to the re engine via going thru Python layers on your way
in and out, while regex was pure C.

In any case, /F is working on a new engine (for Unicode), and I believe he
has this all well in hand.

> Depending how far people want to go with things, adding some
> language syntax to support regular expressions might be in order.
> ...
>     3. I've not yet used it, but I am told the pattern matching in
>        Marc-Andre Lemburg's mxTextTools
>       (http://starship.python.net/crew/lemburg/)
>        is both powerful and efficient (though it certainly appears
>        complex).  Perhaps it deserves consideration for
>        incorporation into the core Python distribution.

It's not complex, it's complicated -- and *that's* what makes it un-Pythonic
<wink>.  Tony Ibbs has written a friendly wrapper around mxTextTools that
suppresses much of the non-essential complication.  OTOH, if you go into
this with a regexp mindset, it will run much slower than a real regexp
package, because the bulk of the latter is devoted to doing optimization;
mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls
if you e.g. try to implement naive backtracking).

You should go to the REBOL site and look at the description of REBOL's PARSE
verb in the FAQ ... mumble, mumble ... at

    http://www.rebol.com/faq.html#11550948

Here's an example pulled from that page (this is a REBOL code fragment):

    digit: charset "0123456789"
    expr: [term ["+" | "-"] expr | term]
    term: [factor ["*" | "/"] term | factor]
    factor: [primary "**" factor | primary]
    primary: [value | "(" expr ")"]
    value: [digit value | digit]

    parse "1 + 2 ** 9" expr

There hasn't been a pattern scheme this clean, convenient or powerful since
SNOBOL4.  It exploits REBOL's Forth-like (lack of!) syntax, and
Smalltalk-like penchant for passing around thunks (anonymous closures --
"[...]" in REBOL builds a lexically-scoped entity called "a block", which
can be treated as code (executed) or data (manipulated like a Python list)
at will).

Now the example doesn't show this, but you can freely mix computations into
the middle of the patterns; only *some* of the words in the blocks have
special meaning to PARSE.  The fragment above is already way beyond what can
be accomplished with regexps, but that's just the start of it.  Perl too is
slamming in more & more ways to get user code to interact with its regexp
engine.

So REBOL has a *very* nice approach to this; I believe it's unreasonably
clumsy to mimic in Python primarily because of forward references (note e.g.
that the block attached to "expr" above refers to "term" before the latter
has been bound -- but the stuff inside [...] is just a closure so that
doesn't matter -- it only matters that term gets bound before expr is
*executed*).  I hit a similar snag years ago when trying to mimic SNOBOL4's
approach in Python.

Perl's endless abuse of regexps is making that language more absurd by the
month.

The other major approach to mixing patterns with computation is due to Icon,
another language where a regexp mindset is fatal.  On a whim, I whipped up
the attached, which illustrates a bit of the Icon approach in Pythonic terms
(but without language support for generators, the *heart* of it can't really
be captured).  Here's an example of how this could be used to implement (the
simplest form of) string.split:

def mysplit(str):
    s = Searcher(str)
    white = CharSet(" \t\n")
    result = []
    s.many(white)            # consume initial whitespace
    while s.notmany(white):  # consume non-whitespace
        result.append(s.get_match())
        s.many(white)
    return result

>>> mysplit("   \t Hey,   that's\tpretty\n\n neat!  ")
['Hey,', "that's", 'pretty', 'neat!']
>>>

The primary thing to note is that there's no seam between analyzing the
string and doing computation on the partial results -- "the program is the
pattern".  This is what Icon does to perfection, Perl is moving toward, and
REBOL is arriving at from a different direction.  It's The Future <0.9
wink>.

Without generators it's difficult to work backtracking into the Searcher
class, but, as above, in my experience the backtracking feature of regexps
is rarely *needed*!  For example, at various points "split" wants to suck up
all the whitespace characters, and that's *it* -- the backtracking
possibility in the regexp \s+ is often a bug just waiting for unexpected
*context* to trigger it.  A hairy regexp is pure hell; but what simpler
regexps can do don't require all that funky regexp machinery.

BTW, the mxTextTools engine could be used to get blazing implementations of
the primary Searcher methods (it excels at simple analysis).  OTOH, making
lots of calls to analyze short strings is slow.  The only clean solutions to
that are Perl's and Icon's (build everyting into one language so the
compiler can optimize stuff away), and REBOL's (make no distinction between
code and data, so that code can be analyzed & optimized at runtime -- and
build the entire implementation around making closures and calls
supernaturally fast).

the-less-you-use-regexps-the-less-you-miss-'em<wink>-ly y'rs  - tim

class CharSet:
    def __init__(self, seq):
        self.seq = seq
        d = {}
        for ch in seq:
            d[ch] = 1
        self.haskey = d.has_key

    def __call__(self, ch):
        return self.haskey(ch)

    def __add__(self, other):
        if isinstance(other, CharSet):
            other = other.seq
        return CharSet(self.seq + other)

def _normalize_index(i, n):
    assert n >= 0
    if i >= 0:
        return min(i, n)
    elif n == 0:
        return 0
    # want smallest q s.t. i + q*n >= 0
    # <->  q*n >= -i
    # <->  q >= -i/n
    # so q = ceiling(-i/n) = -floor(i/n)
    return i - (i/n)*n

class Searcher:
    def __init__(self, str, lo=0, hi=None):
        """Create object to search in str[lo:hi].

        lo defaults to 0.
        hi defaults to len(str).
        len(str) is repeatedly added to negative lo or hi until
        reaching a number >= 0.
        If lo > hi, a uselessly empty slice will be searched.
        The search cursor is initialized to lo.
        """

        self.s = str
        self.lo = _normalize_index(lo, len(str))
        if hi is None:
            self.hi = len(str)
        else:
            self.hi = _normalize_index(hi, len(str))
        if self.lo > self.hi:
            self.hi = self.lo
        self.i = self.lo
        self.lastmatch = None, None

    def any(self, charset, consume=1):
        """Try to match single character in charset.

        Return true iff match succeeded.
        Advance cursor iff success and optional arg "consume" is true.
        """

        i = self.i
        if i < self.hi and charset(self.s[i]):
            if consume:
                self.__consume(i+1)
            return 1
        return 0

    def notany(self, charset, consume=1):
        """Try to match single character not in charset.

        Return true iff match succeeded.
        Advance cursor iff success and optional arg "consume" is true.
        """

        i = self.i
        if i < self.hi and not charset(self.s[i]):
            if consume:
                self.__consume(i+1)
            return 1
        return 0

    def many(self, charset, consume=1):
        """Try to match one or more characters in charset.

        Return true iff match succeeded.
        Advance cursor iff success and optional arg "consume" is true.
        """

        i, n, s = self.i, self.hi, self.s
        j = i
        while j < n and charset(s[j]):
            j = j+1
        if i < j:
            if consume:
                self.__consume(j)
            return 1
        return 0

    def notmany(self, charset, consume=1):
        """Try to match one or more characters not in charset.

        Return true iff match succeeded.
        Advance cursor iff success and optional arg "consume" is true.
        """

        i, n, s = self.i, self.hi, self.s
        j = i
        while j < n and not charset(s[j]):
            j = j+1
        if i < j:
            if consume:
                self.__consume(j)
            return 1
        return 0

    def match(self, str, consume=1):
        """Try to match string "str".

        Return true iff match succeeded.
        Advance cursor iff success and optional arg "consume" is true.
        """

        i = self.i
        j = i + len(str)
        if self.s[i:j] == str:
            if consume:
                self.__consume(j)
            return 1
        return 0

    def get_str(self):
        """Return subject string."""
        return self.s

    def get_lo(self):
        """Return low slice bound."""
        return self.lo

    def get_hi(self):
        """Return high slice bound."""
        return self.hi

    def get_pos(self):
        """Return current value of search cursor."""
        return self.i

    def get_match_indices(self):
        """Return slice indices of last "consumed" match."""
        return self.lastmatch

    def get_match(self):
        """Return last "consumed" matching substring."""
        i, j = self.lastmatch
        if i is None:
            return ValueError("no match to return!")
        return self.s[i:j]

    def set_pos(self, pos, consume=1):
        """Set search cursor to new value.  No return value.

        If optional arg "consume" is true, the last match is set to
        the slice between pos and the current cursor position.
        """

        p = _normalize_index(pos, len(self.s))
        if not self.lo <= p <= self.hi:
            raise ValueError("pos out of bounds: " + `pos`)
        if consume:
            self.__consume(p)
        else:
            self.i = p

    def move_pos(self, incr, consume=1):
        """Move the cursor by incr characters.  No return value.

        If the new value is outside the slice bounds, it's clipped.
        If optional arg "consume" is true, the last match is set to
        the slice between the old and new cursor positions.
        """

        newi = self.i + incr
        if newi < self.lo:
            newi = self.lo
        elif newi > self.hi:
            newi = self.hi
        if consume:
            self.__consume(newi)
        else:
            self.i = newi

    def __consume(self, newi):
        i, j = self.i, newi
        if i > j:
            i, j = j, i
        self.lastmatch = i, j
        self.i = newi