[omaha] Parsing bad html

Mike Hostetler mike at hostetlerhome.com
Wed Dec 12 14:57:55 CET 2007



Eli Criffield wrote:
> I love BeautifulSoup, it takes any random
crap html you throw at it
> and turns in a beautiful pythonic
object with methods for everything
> you would want to be able to
do with it. I use it to even "prettify"
> (yep thats a
method for a Soup object too) some of my html before i
>
publish.

I was going to say the same thing.  If it looks
something like HTML, then BeautifulSoup can parse it. It's really a killer
library for Python (although now there is a Ruby version of it).

BeautifulSoup also has one of my favorite-named classes of all time:

 class UnicodeDammit
     |  A
class for detecting the encoding of a *ML document and
     |  converting it to a Unicode string. If
the source encoding is
     |  windows-1252,
can replace MS smart quotes with their HTML or XML
     |  equivalents.

I have a script
that combines BeautifulSoup and ClientCookie to fetch and display
application log files, which is much better than my employeers
clicky-clicky-clicky web application for it.

http://wwwsearch.sourceforge.net/ClientCookie/
(now apparently part
of:
http://wwwsearch.sourceforge.net/mechanize/
)

Mike Hostetler
http://mike.hostetlerhome.com



More information about the Omaha mailing list