HTML parsing confusion

Alnilam alnilam at gmail.com
Tue Jan 22 11:01:44 EST 2008


On Jan 22, 8:44 am, Alnilam <alni... at gmail.com> wrote:
> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> > 200-modules PyXML package installed. And you don't want the 75Kb
> > BeautifulSoup?
>
> I wasn't aware that I had PyXML installed, and can't find a reference
> to having it installed in pydocs. ...

Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.

# get the source (assuming you don't have it locally and have an
internet connection)
>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/")
>>> source = page.read()
>>> page.close()

# set up some regex to find tags, strip them out, and correct some
formatting oddities
>>> import re
>>> p = re.compile(r'(<p.*?>.*?</p>)',re.DOTALL)
>>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL)
>>> fix_format = re.compile(r'\n +',re.MULTILINE)

# achieve clean results.
>>> paragraphs = re.findall(p,source)
>>> text_list = re.findall(tag_strip,paragraphs[5])
>>> text = "".join(text_list)
>>> clean_text = re.sub(fix_format," ",text)

This works, and is small and easily reproduced, but seems like it
would break easily and seems a waste of other *ML specific parsers.



More information about the Python-list mailing list