[omaha] Parsing bad html

Matthew Nuzum newz at bearfruit.org
Wed Dec 12 16:38:43 CET 2007


On Dec 12, 2007 7:57 AM, Mike Hostetler <mike at hostetlerhome.com> wrote:

> I was going to say the same thing. If it looks
> something like HTML, then BeautifulSoup can parse it. It's really a killer
> library for Python (although now there is a Ruby version of it).
>
> BeautifulSoup also has one of my favorite-named classes of all time:
>
> class UnicodeDammit
>  | A
> class for detecting the encoding of a *ML document and
>  | converting it to a Unicode string. If
> the source encoding is
>  | windows-1252,
> can replace MS smart quotes with their HTML or XML
>  | equivalents.
>
>
Oh, that is so beautiful. Nice tip, cp1252 is a curse.

-- 
Matthew Nuzum
newz2000 on freenode


More information about the Omaha mailing list