[Tutor] RE troubles

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sat Aug 14 01:15:44 CEST 2004



On Fri, 13 Aug 2004, [iso-8859-1] =D8yvind wrote:

> I am trying to make sense of the RE module and get the correct output. I
> have a document where there are five instances of a word or a sentence.
> How many characters or what kind of characters are unknown. I have
> downloaded Kodos, but can't get any wiser.


Hello!


"Kodos"?  Oh, you mean the Kodos regular expression debugger.

    http://kodos.sourceforge.net/

Cool; I didn't know about this one.



> I know that the word is following "target=3D"_top">" and is before "</a><=
a
> href=3Djavascript". So the document will contain five instances of:
>
> target=3D"_top"> word1 </a><a href=3Djavascript
> target=3D"_top">sentence 2</a><a href=3Djavascript
> and so forth....
>
> How do I get them out?


Can you show us what you have tried so far?



You can probably get what you want by doing something like this:

###
>>> regex =3D re.compile(r"""\|
=2E..                        (.*?)
=2E..                        \|""", re.VERBOSE)
>>>
###



The above is a regular expression that will hunt for things between pipe
symbols.  For example, we can use findall():

###
>>> regex.findall(" |this is| a test of the |emergency| |broadcast
system|")
['this is', 'emergency', 'broadcast system']
###

and get all the "piped" words in a snap.


The slightly tricky part of the pattern above is the use a wildcard (.*)
to grab at the content in between.  We have to make sure that the match is
"nongreedy", by adding a question mark to the wildcard.  (.*?)



What does it mean for a match to be greedy?  Let's see what happens if we
leave the question mark off:

###
>>> regex =3D re.compile(r"""\|
=2E..                        (.*)
=2E..                        \|""", re.VERBOSE)
>>>
>>> regex.findall(" |this is| a test of the |emergency| |broadcast
system|")
['this is| a test of the |emergency| |broadcast system']
###

When we search for all occurrences of things between pipes, this time we
get just one element.  The regular expression engine is giving us an
answer that's technically true, since the thing it found was surrounded by
pipes.  But it's not giving us a minimally correct answer, which is why we
call it "greedy".


You may find the Python Regex HOWTO useful, as it explains these concepts:

    http://www.amk.ca/python/howto/regex/



Hope this helps!




More information about the Tutor mailing list