simple regular expression problem

Mon Sep 17 09:50:32 EDT 2007

duikboot a écrit :
> Hello,
> 
> I am trying to extract a list of strings from a text. I am looking it
> for hours now, googling didn't help either.
> Could you please help me?
> 
>>>> s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
>>>> regex = re.compile(r'<organisatie.*</organisatie>', re.S)
>>>> L = regex.findall(s)
>>>> print L
> ['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
> 
> I expected:
> [('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
> \n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
> organisatie')]
> 
> I must be missing something very obvious.

wrt/ regexp, Jason gave you the answer. Another point is that, when 
dealing with XML, it's sometime better to use an XML parser.

Q&D :

 >>> from xml.etree import ElementTree as ET
 >>> s = "<root>" + s + "</root>"
 >>> tree = ET.fromstring(s)
 >>> tree
<Element root at b795b2ac>
 >>> tree.findall("organisatie/Profiel_Id")
[<Element Profiel_Id at b795b32c>, <Element Profiel_Id at b795b3ec>]
 >>> _[0].text
'28996'
 >>> [it.text for it in tree.findall("organisatie/Profiel_Id")]
['28996', '28997']
 >>>

HTH