Python3: Sane way to deal with broken encodings

Sun Dec 6 15:26:30 EST 2009

Dear all,

I've some applciations which fetch HTML docuemnts off the web, parse
their content and do stuff with it. Every once in a while it happens
that the web site administrators put up files which are encoded in a
wrong manner.

Thus my Python script dies a horrible death:

  File "./update_db", line 67, in <module>
    for line in open(tempfile, "r"):
  File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

Kind regards,
Johannes

-- 
"Aus starken Potentialen können starke Erdbeben resultieren; es können
aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
(!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
-- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
<1a30da36-68a2-4977-9eed-154265b17d28 at q14g2000vbi.googlegroups.com>