[BangPypers] UnicodeDecodeError: 'utf8' codec can't decode byte xxx

Nikunj Badjatya nikunjbadjatya at gmail.com
Sun Apr 17 16:31:47 CEST 2011


Hi All,

I am working on a self project for grabbing certain URL's from the web. Do
some processing and store the final contents in text/pdf file.

I am also using html2text (
https://github.com/aaronsw/html2text/archives/master ) for converting the
fetched page into text format.
As a first step I tried with fetching and converting to text using following
code.

Code :
{{{
#!/bin/python

import os
import urllib

fetch = urllib.urlopen("some-web-link.htm")

mainfile = open ('main.html', 'w' )

mainfile.write(fetch.read())

os.system('python2.6 html2text.py main.html > main.txt')

}}}

It flags an error:
{{{
Traceback (most recent call last):
  File "html2text.py", line 447, in <module>
    data = open(arg, 'r').read().decode(encoding)
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 11366:
invalid start byte

}}}

I also tried with
{{{
+ import codecs

...
...
- mainfile = open ('main.html', 'w' )
+mainfile = codecs.open('xyz.htm', 'w', None, 'ignore')

...
...
}}}

Result is coming the same.

Please tell as to what can be done to avoid this error.?


Thanks,

Nikunj
Bangalore, India


More information about the BangPypers mailing list