[BangPypers] UnicodeDecodeError: 'utf8' codec can't decode byte xxx
Nikunj Badjatya
nikunjbadjatya at gmail.com
Sun Apr 17 16:31:47 CEST 2011
Hi All,
I am working on a self project for grabbing certain URL's from the web. Do
some processing and store the final contents in text/pdf file.
I am also using html2text (
https://github.com/aaronsw/html2text/archives/master ) for converting the
fetched page into text format.
As a first step I tried with fetching and converting to text using following
code.
Code :
{{{
#!/bin/python
import os
import urllib
fetch = urllib.urlopen("some-web-link.htm")
mainfile = open ('main.html', 'w' )
mainfile.write(fetch.read())
os.system('python2.6 html2text.py main.html > main.txt')
}}}
It flags an error:
{{{
Traceback (most recent call last):
File "html2text.py", line 447, in <module>
data = open(arg, 'r').read().decode(encoding)
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 11366:
invalid start byte
}}}
I also tried with
{{{
+ import codecs
...
...
- mainfile = open ('main.html', 'w' )
+mainfile = codecs.open('xyz.htm', 'w', None, 'ignore')
...
...
}}}
Result is coming the same.
Please tell as to what can be done to avoid this error.?
Thanks,
Nikunj
Bangalore, India
More information about the BangPypers
mailing list