simple regular expression problem

Mon Sep 17 09:22:08 EDT 2007

You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.
http://docs.python.org/lib/module-re.html

Jason

On Sep 17, 9:00 am, duikboot <dijkstra.ar... at gmail.com> wrote:
> Hello,
>
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
>
> >>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
> >>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
> >>> L = regex.findall(s)
> >>> print L
>
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
>
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
>
> I must be missing something very obvious.
>
> Greetings Arjen