[BangPypers] UnicodeDecodeError: 'utf8' codec can't decode byte xxx

Nikunj Badjatya nikunjbadjatya at gmail.com
Sun Apr 17 17:13:37 CEST 2011


Thanks for the quick reply..
I hve never touched Django before.

I tried as:
{{{

#!/bin/python

import os
import urllib
+ from django.utils.encoding import smart_str

fetch = urllib.urlopen("some-web-link.htm")

mainfile = open ('main.html', 'w' )

+ myunistr = smart_str(fetch)

print myunistr

mainfile.write(myunistr)

os.system('python2.6 html2text.py main.html > main.txt')

}}}

The execution went fine without any issues. But when I open the "main.html".
I was expecting it to havee full contents of the page . But it has only ,
{{{
<addinfourl at 148983116 whose fp = <socket._fileobject object at
0x8deabac>>
}}}

Please let me know if I am missing something.

Thanks,
Nikunj



On Sun, Apr 17, 2011 at 8:11 PM, JAGANADH G <jaganadhg at gmail.com> wrote:

> On Sun, Apr 17, 2011 at 8:01 PM, Nikunj Badjatya
> <nikunjbadjatya at gmail.com>wrote:
>
> > Hi All,
> >
> > I am working on a self project for grabbing certain URL's from the web.
> Do
> > some processing and store the final contents in text/pdf file.
> >
> > I am also using html2text (
> > https://github.com/aaronsw/html2text/archives/master ) for converting
> the
> > fetched page into text format.
> > As a first step I tried with fetching and converting to text using
> > following
> > code.
> >
> > Code :
> > {{{
> > #!/bin/python
> >
> > import os
> > import urllib
> >
> > fetch = urllib.urlopen("some-web-link.htm")
> >
> > mainfile = open ('main.html', 'w' )
> >
> > mainfile.write(fetch.read())
> >
> > os.system('python2.6 html2text.py main.html > main.txt')
> >
> > }}}
> >
> > It flags an error:
> > {{{
> > Traceback (most recent call last):
> >  File "html2text.py", line 447, in <module>
> >    data = open(arg, 'r').read().decode(encoding)
> >  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
> >    return codecs.utf_8_decode(input, errors, True)
> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position
> 11366:
> > invalid start byte
> >
> > }}}
> >
> > I also tried with
> > {{{
> > + import codecs
> >
> > ...
> > ...
> > - mainfile = open ('main.html', 'w' )
> > +mainfile = codecs.open('xyz.htm', 'w', None, 'ignore')
> >
> > ...
> > ...
> > }}}
> >
> > Result is coming the same.
> >
> > Please tell as to what can be done to avoid this error.?
> >
> >
>
>
> Try this
>
> from django.utils.encoding import smart_str
>
> myunistr = smart_str(YOUR_STRING)
>
> This will solve the issue
>
>
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
> *ILUGCBE*
> http://ilugcbe.techstud.org
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>


More information about the BangPypers mailing list