character encoding conversion

Dylan dylans at yahoo.com
Sat Dec 11 20:28:29 EST 2004


Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I've searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/comp.lang.python/browse_thread/thread/116158ad706dc7c1/11991de6ced3406b?q=python+html+parser+cp1252&_done=%2Fgroups%3Fq%3Dpython+html+parser+cp1252%26qt_s%3DSearch+Groups%26&_doneTitle=Back+to+Search&&d#11991de6ced3406b
) .  However, I am still unable to convert the characters to something
meaningful.  In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I'm using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3 

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/






More information about the Python-list mailing list