RegExp Help
Sean DiZazzo
half.italian at gmail.com
Fri Dec 14 12:45:05 EST 2007
On Dec 14, 3:06 am, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <half.ital... at gmail.com>
> escribió:
>
>
>
> > On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> >> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:
> >> > I'm wrapping up a command line util that returns xml in Python. The
> >> > util is flaky, and gives me back poorly formed xml with different
> >> > problems in different cases. Anyway I'm making progress. I'm not
> >> > very good at regular expressions though and was wondering if someone
> >> > could help with initially splitting the tags from the stdout returned
> >> > from the util.
>
> >> Flaky XML is often produced by programs that treat XML as ordinary text
> >> files. If you are starting to parse XML with regular expressions you are
> >> making the very same mistake. XML may look somewhat simple but
> >> producing correct XML and parsing it isn't. Sooner or later you stumble
> >> across something that breaks producing or parsing the "naive" way.
>
> > It's not really complicated xml so far, just tags with attributes.
> > Still, using different queries against the program sometimes offers
> > differing results...a few examples:
>
> > <id 123456 />
> > <tag name="foo" />
> > <tag2 name="foo" moreattrs="..." /tag2>
> > <tag3 name="foo" moreattrs="..." tag3/>
>
> Ouch... only the second is valid xml. Most tools require at least a well
> formed document. You may try using BeautifulStoneSoup, included with
> BeautifulSouphttp://crummy.com/software/BeautifulSoup/
>
> > I found something that works, although I couldn't tell you why it
> > works. :)
> > retag = re.compile(r'<.+?>', re.DOTALL)
> > tags = retag.findall(retag)
> > Why does that work?
>
> That means: "look for a less-than sign (<), followed by the shortest
> sequence of (?) one or more (+) arbitrary characters (.), followed by a
> greater-than sign (>)"
>
> If you never get nested tags, and never have a ">" inside an attribute,
> that expression *might* work. But please try BeautifulStoneSoup, it uses a
> lot of heuristics trying to guess the right structure. Doesn't work
> always, but given your input, there isn't much one can do...
>
> --
> Gabriel Genellina
Thanks! I'll take a look at BeautifulStoneSoup today and see what I
get.
~Sean
More information about the Python-list
mailing list