simple regular expression problem

Mon Sep 17 10:53:17 EDT 2007

duikboot wrote:

> Hello,
> 
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
> 
>>>>s = """
>>>>\n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
>>>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
>>>> L = regex.findall(s)
>>>> print L
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
> 
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
> 
> I must be missing something very obvious.

Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.

Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:

from lxml import etree

tree =
etree.fromstring("""<root><organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")

for feld in tree.xpath('//organisatie/Profiel_Id'):
    print feld.text

Diez