[omaha] Parsing bad html
Mike Hostetler
mike at hostetlerhome.com
Wed Dec 12 14:57:55 CET 2007
Eli Criffield wrote:
> I love BeautifulSoup, it takes any random
crap html you throw at it
> and turns in a beautiful pythonic
object with methods for everything
> you would want to be able to
do with it. I use it to even "prettify"
> (yep thats a
method for a Soup object too) some of my html before i
>
publish.
I was going to say the same thing. If it looks
something like HTML, then BeautifulSoup can parse it. It's really a killer
library for Python (although now there is a Ruby version of it).
BeautifulSoup also has one of my favorite-named classes of all time:
class UnicodeDammit
| A
class for detecting the encoding of a *ML document and
| converting it to a Unicode string. If
the source encoding is
| windows-1252,
can replace MS smart quotes with their HTML or XML
| equivalents.
I have a script
that combines BeautifulSoup and ClientCookie to fetch and display
application log files, which is much better than my employeers
clicky-clicky-clicky web application for it.
http://wwwsearch.sourceforge.net/ClientCookie/
(now apparently part
of:
http://wwwsearch.sourceforge.net/mechanize/
)
Mike Hostetler
http://mike.hostetlerhome.com
More information about the Omaha
mailing list