simple regular expression problem
Jason Drew
jasondrew72 at gmail.com
Mon Sep 17 09:22:08 EDT 2007
You just need a one-character addition to your regex:
regex = re.compile(r'<organisatie.*?</organisatie>', re.S)
Note, there is now a question mark (?) after the .*
By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.
This is covered in the documentation for the re module.
http://docs.python.org/lib/module-re.html
Jason
On Sep 17, 9:00 am, duikboot <dijkstra.ar... at gmail.com> wrote:
> Hello,
>
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
>
> >>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
> >>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
> >>> L = regex.findall(s)
> >>> print L
>
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
>
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
>
> I must be missing something very obvious.
>
> Greetings Arjen
More information about the Python-list
mailing list