simple regular expression problem
Diez B. Roggisch
deets at nospam.web.de
Mon Sep 17 10:53:17 EDT 2007
duikboot wrote:
> Hello,
>
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
>
>>>>s = """
>>>>\n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
>>>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
>>>> L = regex.findall(s)
>>>> print L
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
>
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
>
> I must be missing something very obvious.
Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.
Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:
from lxml import etree
tree =
etree.fromstring("""<root><organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")
for feld in tree.xpath('//organisatie/Profiel_Id'):
print feld.text
Diez
More information about the Python-list
mailing list