[Tutor] trying to convert pycurl/html to ascii

Dave Angel davea at davea.name
Mon Mar 30 04:08:18 CEST 2015


On 03/29/2015 09:49 PM, bruce wrote:
> Hi.
>
> Doing a quick/basic pycurl test on a site and trying to convert the
> returned page to pure ascii.

You cannot convert it to pure ASCII.  You could replace all the invalid 
characters with some special one, like question marks.  But I doubt if 
that's what you really want.

>
> The page has the encoding line
>
> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

That would mean you should use 8859 in your decode.

>
> The test uses pycurl, and the StringIO to fetch the page into a str.
>
> pycurl stuff
> .
> .
> .
> foo=gg.getBuffer()
>
> -at this point, foo has the page in a str buffer.
>
>
> What's happening, is that the test is getting the following kind of error/
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 20:
> invalid start byte

That's not the whole error.  You need to show the whole stack trace, not 
just a single line.  It would also be really useful if you showed the 
lines between the  foo= line and the one that gets the error.


>
> The test is using python 2.6 on redhat.
>
Very good to tell us that.  It makes a huge difference.

> I've tried different decode functions based on different
> sites/articles/stackoverflow but can't quite seem to resolve the issue.
>

Pick one, show us the code, and show us the full error traceback, and 
somebody can help.  As it stands all I can tell us is a decode takes a 
byte string and an encoding name, and produces a unicode object.  And 
it's not going to give you a utf-8 error if you're trying to decode 8859.

-- 
DaveA


More information about the Tutor mailing list