simple regular expression problem

Jason Drew jasondrew72 at gmail.com
Mon Sep 17 09:22:08 EDT 2007


You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.
http://docs.python.org/lib/module-re.html

Jason

On Sep 17, 9:00 am, duikboot <dijkstra.ar... at gmail.com> wrote:
> Hello,
>
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
>
> >>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
> >>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
> >>> L = regex.findall(s)
> >>> print L
>
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
>
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
>
> I must be missing something very obvious.
>
> Greetings Arjen





More information about the Python-list mailing list