Regular expressions in Python

Alex Martelli aleaxit at yahoo.com
Sun Sep 3 10:54:57 EDT 2000


<johnvert at my-deja.com> wrote in message news:8otjf2$5d5$1 at nnrp1.deja.com...
> Hello,
>
> I have a few questions regarding the usage of regular expressions in
> Python.
>
> 1)  In Perl, I can do something like
>
>     if (/START(.+?)END/) {
>       use $1 here (value caught in (.+?))
>     }
>
>     what is the equivalent of Perl's $1, $2, ... in Python.

The re.search method (or the .method call on a compiled
re object) returns a match-object.  The match-object has
a group() method that will return one or more matched
groups.  So, your specific example:

if(/START(.+)?END/) {
    &someuse($1);
}

becomes (assuming the string to be examined is in a
variable named line):

mo = re.search(/START(.+)?END/, line)
if mo:
    someuse(mo.group(1))


> 2)  This question is not directly related to regular expressions,
> but        more to parsing text in Python in general:
>
>     I want to capture stuff between START and END, like in the
> above         regular expression, but START, the stuff in the middle,
> and END are      not necessarily on the same line.  The only way I can
> think of is to     read the whole file into memory as a string, and
> operate on that         string, or read it with readlines() and join()
> those to a string.        Both of these approaches would be slow because
> the file would be         read in one slurp.  Is there a way to handle

Why do you think reading the file in one slurp is slow?  It's fastest,
if the file does fit into memory (a few tens of megabytes worth of
text).  It's also what one would normally do in Perl.

Reading with readlines() then doing a join() is needlessly slow; just
read the whole file with read(), that's what it's for.  And of course,
do supply the MULTILINE flag to re.search:-).

> this `multiple line'        parsing in a way that I can read the file
> line by line, as in:
>
>     while 1:
>       line = file.readline()
>       # parse

Sure, but you have to keep track of the parsing state yourself -- i.e.
implement a finite state machine to remind you of whether you've
already seen START, etc.  Not worth it, unless you do fear files of
hundreds of megabytes or more will need to be parsed.

Assuming what you want to do is process each block of text that
is between a START and the immediately following END, this
would work:

    state=neutral
    while 1:
        line = file.readline()
        if not line: break
        if state==neutral:
            most=re.match('START',line)
            if not most: continue
            pieces=[]
            lastpiece=line[most.end():]
            state=looking
        moen=re.match('END',lastpiece)
        if not moen:
            pieces.append(lastpiece)
            continue
        pieces.append(lastpiece[:moen.start()])
        someuse(string.join(pieces))
        state=neutral

This is _probably_ correct or close to it -- but why bother to
test/debug/correct (much less to write it in the first place)
when you can have an *obviously correct* implementation...?

Remember, there's 2 ways to write a program: make it so
simple that there are obviously no bug, or, so complex that
there are no OBVIOUS bugs.  Always do the simplest thing
that could possibly work...


Alex






More information about the Python-list mailing list