RegExp Help

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Fri Dec 14 06:06:00 EST 2007


En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <half.italian at gmail.com>  
escribió:

> On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:
>> > I'm wrapping up a command line util that returns xml in Python.  The
>> > util is flaky, and gives me back poorly formed xml with different
>> > problems in different cases.  Anyway I'm making progress.  I'm not
>> > very good at regular expressions though and was wondering if someone
>> > could help with initially splitting the tags from the stdout returned
>> > from the util.
>>
>> Flaky XML is often produced by programs that treat XML as ordinary text
>> files. If you are starting to parse XML with regular expressions you are
>> making the very same mistake.  XML may look somewhat simple but
>> producing correct XML and parsing it isn't.  Sooner or later you stumble
>> across something that breaks producing or parsing the "naive" way.
>>
> It's not really complicated xml so far, just tags with attributes.
> Still, using different queries against the program sometimes offers
> differing results...a few examples:
>
> <id 123456 />
> <tag name="foo" />
> <tag2 name="foo" moreattrs="..." /tag2>
> <tag3 name="foo" moreattrs="..." tag3/>

Ouch... only the second is valid xml. Most tools require at least a well  
formed document. You may try using BeautifulStoneSoup, included with  
BeautifulSoup http://crummy.com/software/BeautifulSoup/

> I found something that works, although I couldn't tell you why it
> works.  :)
>  retag = re.compile(r'<.+?>', re.DOTALL)
> tags = retag.findall(retag)
>  Why does that work?

That means: "look for a less-than sign (<), followed by the shortest  
sequence of (?) one or more (+) arbitrary characters (.), followed by a  
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,  
that expression *might* work. But please try BeautifulStoneSoup, it uses a  
lot of heuristics trying to guess the right structure. Doesn't work  
always, but given your input, there isn't much one can do...


-- 
Gabriel Genellina




More information about the Python-list mailing list