simple regular expression problem

George Sakkis george.sakkis at gmail.com
Mon Sep 17 09:48:41 EDT 2007


On Sep 17, 9:00 am, duikboot <dijkstra.ar... at gmail.com> wrote:

> Hello,
>
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
>
> >>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
> >>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
> >>> L = regex.findall(s)
> >>> print L
>
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
>
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
>
> I must be missing something very obvious.

The less obvious thing that you're missing is that regular expressions
is not the best solution to every text-related problem. Thinking at a
higher level helps sometimes; for example here you don't  want to
extract "a list of strings from a text", you want to extract specific
elements from an XML data source. There are several standard and non
standard python packages for XML processing, look for them online.
Here's how to do it using the (3rd party) BeautyfulSoup module:

>>> from BeautifulSoup import BeautifulStoneSoup
>>> BeautifulStoneSoup(s).findAll('organisatie')
[<organisatie>
<profiel_id>28996</profiel_id>
</organisatie>, <organisatie>
<profiel_id>28997</profiel_id>
</organisatie>]


HTH,
George




More information about the Python-list mailing list