[omaha] Parsing bad html

Wed Dec 12 04:36:49 CET 2007

One of the reasons I like Python:

I had to reformat some html today to take it from a poorly hand coded
page and get it in to a wiki.

Here is an example nugget of the raw source html ( this isn't the
worst - which had mangled mismatched tags)
<a href="http://www.newscientist.com/">NS+</a>
       -- <a href="http://www.adquest3d.com/">Classified Ads </a>-- <a
href="http://www.radio-locator.com/cgi-bin/home">Radio
        </a>-- <a href="http://www.bookbrowser.com/Resources/Index.html">Book
        Links</a></b></font></font><b><font
face="Arial,Helvetica,Monaco"><font size="1"><a
href="http://www.ceoexpress.com/"> </a>--
        <a href="http://www.obscurestore.com/">Obscure</a> -- <a
href="http://www.ebay.com/">eBAY</a>
       -- <a href="http://www.online-pr.com/">Online PR</a> -- <a
href="http://catalogs.google.com/">Catalogs</a>
       -- <a href="http://www.nytimes.com/books/first/first-nonfiction.html">FirstChaps</a>
       -- <a href="http://www.loc.gov/">LOC</a> -- <a
href="http://www.ac6v.com/swl1.htm#WEBRADIO">WebRadio</a>

I needed to get in into a form like "* [[ TITLE | URL ]]"

Well I've used Beautifulsoup
(http://www.crummy.com/software/BeautifulSoup/) before but its been a
while so the exact way to do it was not in my L1 cache<g>.  I knew I
didn't have the module on this machine so I had to get it loaded and
start from there.  I searched the docs for "links" and found
http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document
-- midway of the section is an example that is darn near exactly what
I want.

What follows is what I did next:

jlh at jlh-d520:~$ sudo easy_install beautifulsoup
[sudo] password for jlh:
Searching for beautifulsoup
Reading http://cheeseshop.python.org/pypi/beautifulsoup/
Couldn't find index page for 'beautifulsoup' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading http://cheeseshop.python.org/pypi/
Reading http://cheeseshop.python.org/pypi/BeautifulSoup/3.0.4
Reading http://www.crummy.com/software/BeautifulSoup/
Reading http://www.crummy.com/software/BeautifulSoup/download/
Best match: BeautifulSoup 3.0.4
Downloading http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup-3.0.4.tar.gz
Processing BeautifulSoup-3.0.4.tar.gz
Running BeautifulSoup-3.0.4/setup.py -q bdist_egg --dist-dir
/tmp/easy_install-Ihuiu5/BeautifulSoup-3.0.4/egg-dist-tmp-gKUTwa
zip_safe flag not set; analyzing archive contents...
Adding BeautifulSoup 3.0.4 to easy-install.pth file

Installed /usr/lib/python2.5/site-packages/BeautifulSoup-3.0.4-py2.5.egg
Processing dependencies for beautifulsoup
Finished processing dependencies for beautifulsoup
jlh at jlh-d520:~$ python
Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = """
... <a href="http://www.google.com/">Google</a>
...        -- <a href="http://www.alltheweb.com/">FAST</a> --<a
href="http://www.profusion.com/"> Prof</a>
...        -- <a href="http://www.ftpsearchengines.com/">FTP</a> -- <a
href="http://dogpile.com/">Dogpile</a> --<a
href="http://www.beaucoup.com/">Beaucoup</a>
</b></font></font><b><font face="Arial,Helvetica,Monaco"><font
size="1">--
...         <a href="http://www.findarticles.com/PI/index.jhtml">Articles</a>
--<a href="http://www.archive.org/"> Archives</a>
...        -- <a href="http://www.allacademic.com/">Academic</a> -- <a
href="http://www.kartoo.com/">Kartoo</a>
...        -- <a href="http://clusty.com/">Clusty </a>-- <a
href="http://www.teoma.com/">Teoma
...         </a>-- <a href="http://beta.search.msn.com/">MSN</a> --<a
href="http://www.cranky.com"><font color="RED"> Cranky</font></a>
...        -- <a href="http://discussion.lycos.com/">Discussions</a> --</font>
... """
>>> from BeautifulSoup import BeautifulSoup, SoupStrainer
>>>
>>>
>>> links = SoupStrainer('a')
>>> thelinks = [tag for tag in BeautifulSoup(s, parseOnlyThese=links)]
... for el in thelinks:
...     print el
...
<a href="http://www.google.com/">Google</a>
<a href="http://www.alltheweb.com/">FAST</a>
<a href="http://www.profusion.com/"> Prof</a>
<a href="http://www.ftpsearchengines.com/">FTP</a>
<a href="http://dogpile.com/">Dogpile</a>
<a href="http://www.beaucoup.com/">Beaucoup</a>
<a href="http://www.findarticles.com/PI/index.jhtml">Articles</a>
<a href="http://www.archive.org/"> Archives</a>
<a href="http://www.allacademic.com/">Academic</a>
<a href="http://www.kartoo.com/">Kartoo</a>
<a href="http://clusty.com/">Clusty </a>
<a href="http://www.teoma.com/">Teoma
        </a>
<a href="http://beta.search.msn.com/">MSN</a>
<a href="http://www.cranky.com"><font color="RED"> Cranky</font></a>
<a href="http://discussion.lycos.com/">Discussions</a>
>>> dir(thelinks)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__delslice__', '__doc__', '__eq__', '__ge__', '__getattribute__',
'__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__',
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
'__setslice__', '__str__', 'append', 'count', 'extend', 'index',
'insert', 'pop', 'remove', 'reverse', 'sort']
>>> dir(thelinks[0])
['XML_SPECIAL_CHARS_TO_ENTITIES', '__call__', '__contains__',
'__delitem__', '__doc__', '__eq__', '__getattr__', '__getitem__',
'__init__', '__iter__', '__len__', '__module__', '__ne__',
'__nonzero__', '__repr__', '__setitem__', '__str__', '__unicode__',
'_findAll', '_findOne', '_getAttrMap', '_lastRecursiveChild',
'append', 'attrs', 'childGenerator', 'containsSubstitutions',
'contents', 'extract', 'fetch', 'fetchNextSiblings', 'fetchParents',
'fetchPrevious', 'fetchPreviousSiblings', 'fetchText', 'find',
'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings', 'first', 'firstText', 'get', 'has_key',
'hidden', 'insert', 'isSelfClosing', 'name', 'next', 'nextGenerator',
'nextSibling', 'nextSiblingGenerator', 'parent', 'parentGenerator',
'parserClass', 'prettify', 'previous', 'previousGenerator',
'previousSibling', 'previousSiblingGenerator',
'recursiveChildGenerator', 'renderContents', 'replaceWith', 'setup',
'string', 'substituteEncoding', 'toEncoding']
>>> thelinks[0].attrs
[(u'href', u'http://www.google.com/')]
>>> thelinks[0].attrs[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> thelinks[0].attrs[0][1]
u'http://www.google.com/'
>>> thelinks[0].fetchText
<bound method Tag.fetchText of <a href="http://www.google.com/">Google</a>>
>>> thelinks[0].fetch
<bound method Tag.findAll of <a href="http://www.google.com/">Google</a>>
>>> thelinks[0].name
u'a'
>>> thelinks[0].setup
<bound method Tag.setup of <a href="http://www.google.com/">Google</a>>
>>> thelinks[0].extract
<bound method Tag.extract of <a href="http://www.google.com/">Google</a>>
>>> thelinks[0].extract()
>>> thelinks[0].contents
[u'Google']
>>> thelinks[0].attrs[0][1]
u'http://www.google.com/'
>>>

dir(something) in the interactive interpreter lists all of the
properties and methods available for a given object.  By looking at
the return of "thelinks" it was obviously a list, or some other
object that was implemented as a list.  So then I needed to figure out
what the list elements were as trying to .strip() them was resulting
in a TypeError.  a quick dir(thelinks[0]) showed that the elements
were not simple strings or lists of strings but a more complicated
object.  Then I just needed to find what would return the URL and
Title.   Not caring to read more documentation you'll see my attempts
before finding the two necessary properties: .contents and .attrs
So I end up with the following script to hammer my way through a few
hundred links

from BeautifulSoup import BeautifulSoup, SoupStrainer

links = SoupStrainer('a')
thelinks = [tag for tag in BeautifulSoup(s, parseOnlyThese=links)]
for el in thelinks:
    try:
        print ' * [[%s|%s]]' % ((el.contents[0]).strip(),el.attrs[0][1])
    except TypeError:
        print ' * [[%s|%s]]' % (el.contents[0],el.attrs[0][1])

You'll notice the try:except block.  I needed that when the .contents
returned a more complicated element than a string.  (i.e. <font
color="red">stuff</font>) That is a html element object and it doesn't
take kindly when I try to peform a .strip() (string object) method on
something that doesn't support it.  For those, I just flattened it out
and hand edited the output.  Total time for researching, experimenting
and implementing about 20 minutes.  Compared with the time it was
taking to edit the source html snippets to wiki links -- that was a
huge savings.

Not a fancy script by any stretch of the imagination -- but a decent
example of using interactive python to your advantage and letting me
remain the lazy guy that I am.  The other thing to remember, when
forced to parse html of questionable quality, BeautifulSoup is your
friend.

-- 
Jeff Hinrichs
Dundee Media & Technology, Inc
jeffh at dundeemt.com
402.218.1473