[Tutor] trying to convert pycurl/html to ascii
Dave Angel
davea at davea.name
Mon Mar 30 04:08:18 CEST 2015
On 03/29/2015 09:49 PM, bruce wrote:
> Hi.
>
> Doing a quick/basic pycurl test on a site and trying to convert the
> returned page to pure ascii.
You cannot convert it to pure ASCII. You could replace all the invalid
characters with some special one, like question marks. But I doubt if
that's what you really want.
>
> The page has the encoding line
>
> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
That would mean you should use 8859 in your decode.
>
> The test uses pycurl, and the StringIO to fetch the page into a str.
>
> pycurl stuff
> .
> .
> .
> foo=gg.getBuffer()
>
> -at this point, foo has the page in a str buffer.
>
>
> What's happening, is that the test is getting the following kind of error/
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 20:
> invalid start byte
That's not the whole error. You need to show the whole stack trace, not
just a single line. It would also be really useful if you showed the
lines between the foo= line and the one that gets the error.
>
> The test is using python 2.6 on redhat.
>
Very good to tell us that. It makes a huge difference.
> I've tried different decode functions based on different
> sites/articles/stackoverflow but can't quite seem to resolve the issue.
>
Pick one, show us the code, and show us the full error traceback, and
somebody can help. As it stands all I can tell us is a decode takes a
byte string and an encoding name, and produces a unicode object. And
it's not going to give you a utf-8 error if you're trying to decode 8859.
--
DaveA
More information about the Tutor
mailing list