Regular expressions in Python
Alex Martelli
aleaxit at yahoo.com
Sun Sep 3 10:54:57 EDT 2000
<johnvert at my-deja.com> wrote in message news:8otjf2$5d5$1 at nnrp1.deja.com...
> Hello,
>
> I have a few questions regarding the usage of regular expressions in
> Python.
>
> 1) In Perl, I can do something like
>
> if (/START(.+?)END/) {
> use $1 here (value caught in (.+?))
> }
>
> what is the equivalent of Perl's $1, $2, ... in Python.
The re.search method (or the .method call on a compiled
re object) returns a match-object. The match-object has
a group() method that will return one or more matched
groups. So, your specific example:
if(/START(.+)?END/) {
&someuse($1);
}
becomes (assuming the string to be examined is in a
variable named line):
mo = re.search(/START(.+)?END/, line)
if mo:
someuse(mo.group(1))
> 2) This question is not directly related to regular expressions,
> but more to parsing text in Python in general:
>
> I want to capture stuff between START and END, like in the
> above regular expression, but START, the stuff in the middle,
> and END are not necessarily on the same line. The only way I can
> think of is to read the whole file into memory as a string, and
> operate on that string, or read it with readlines() and join()
> those to a string. Both of these approaches would be slow because
> the file would be read in one slurp. Is there a way to handle
Why do you think reading the file in one slurp is slow? It's fastest,
if the file does fit into memory (a few tens of megabytes worth of
text). It's also what one would normally do in Perl.
Reading with readlines() then doing a join() is needlessly slow; just
read the whole file with read(), that's what it's for. And of course,
do supply the MULTILINE flag to re.search:-).
> this `multiple line' parsing in a way that I can read the file
> line by line, as in:
>
> while 1:
> line = file.readline()
> # parse
Sure, but you have to keep track of the parsing state yourself -- i.e.
implement a finite state machine to remind you of whether you've
already seen START, etc. Not worth it, unless you do fear files of
hundreds of megabytes or more will need to be parsed.
Assuming what you want to do is process each block of text that
is between a START and the immediately following END, this
would work:
state=neutral
while 1:
line = file.readline()
if not line: break
if state==neutral:
most=re.match('START',line)
if not most: continue
pieces=[]
lastpiece=line[most.end():]
state=looking
moen=re.match('END',lastpiece)
if not moen:
pieces.append(lastpiece)
continue
pieces.append(lastpiece[:moen.start()])
someuse(string.join(pieces))
state=neutral
This is _probably_ correct or close to it -- but why bother to
test/debug/correct (much less to write it in the first place)
when you can have an *obviously correct* implementation...?
Remember, there's 2 ways to write a program: make it so
simple that there are obviously no bug, or, so complex that
there are no OBVIOUS bugs. Always do the simplest thing
that could possibly work...
Alex
More information about the Python-list
mailing list