What the \xc2\xa0 ?!!

Tue Sep 7 13:15:47 EDT 2010

Brian D <briandenzer at gmail.com> writes:

> In an HTML page that I'm scraping using urllib2, a  \xc2\xa0
> bytestring appears.
>
> The page's charset = utf-8, and the Chrome browser I'm using displays
> the characters as a space.
>
> The page requires authentication:
> https://www.nolaready.info/myalertlog.php
>
> When I try to concatenate strings containing the bytestring, Python
> chokes because it refuses to coerce the bytestring into ascii.
>
> wfile.write('|'.join(valueList))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 163: ordinal not in range(128)
>
> In searching for help with this issue, I've learned that the
> bytestring *might* represent a non-breaking space.

It in fact does.

>
> When I scrape the page using urllib2, however, the characters print
> as  ┬á  in a Windows command prompt (though I wouldn't be surprised if
> this is some erroneous attempt by the antiquated command window to
> handle something it doesn't understand).

Yes, it's trying to interpret that as two cp1252 (or whatever) bytes
instead of one unbreakable space.

>
> If I use IDLE to attempt to decode the single byte referenced in the
> error message, and convert it into UTF-8, another error message is
> generated:
>
>>>> weird = unicode('\xc2', 'utf-8')
>
> Traceback (most recent call last):
>   File "<pyshell#72>", line 1, in <module>
>     weird = unicode('\xc2', 'utf-8')
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
> unexpected end of data

Which is to be expected, as you ripped a UTF-8 escape sequence in half.

>
> If I attempt to decode the full bytestring, I don't obtain a human-
> readable string (expecting, perhaps, a non-breaking space):

You obtain a non-breakable space. What do you think it should look like
in your terminal? It looks like ... nothing. Because it looks like a
space.

>
>>>> weird = unicode('\xc2\xa0', 'utf-8')
>>>> par = ' - '.join(['This is', weird])
>>>> par
> u'This is - \xa0'
>
> I suspect that the bytestring isn't UTF-8, but what is it? Latin1?

No, it is UTF-8

>
>>>> weirder = unicode('\xc2\xa0', 'latin1')
>>>> weirder
> u'\xc2\xa0'
>>>> 'This just gets ' + weirder
> u'This just gets \xc2\xa0'
>
> Or is it a Microsoft bytestring?

This is not weird, this is the python interpreter giving you the
representation of a unicode-object when you do not print, so you can see
what it looks like.

And because you wrongly decoded it as latin1, it's garbage anyway.

>
>>>> weirder = unicode('\xc2\xa0', 'mbcs')
>>>> 'This just gets ' + weirder
> u'This just gets \xc2\xa0'
>
> None of these codecs seem to work.

UTF-8 worked just fine.

>
> Back to the original purpose, as I'm scraping the page, I'm storing
> the field/value pair in a dictionary with each iteration through table
> elements on the page. This is all fine, until a value is found that
> contains the offending bytestring. I have attempted to coerce all
> value strings into an encoding, but Python doesn't seem to like that
> when the string is already Unicode:
>
> valuesDict[fieldString] = unicode(value, 'UTF-8')
> TypeError: decoding Unicode is not supported
>
> The solution I've arrived at is to specify the encoding for value
> strings both when reading and writing value strings.
>
> for k, v in valuesDict.iteritems():
>     valuePair = ':'.join([k, v.encode('UTF-8')])
>     [snip] ...
>     wfile.write('|'.join(valueList))
>
> I'm not sure I have a question, but does this sound familiar to any
> Unicode experts out there?
>
> How should I handle these odd bytestring values? Am I doing it
> correctly, or what could I improve?

The overall solution is to decode the page or parts of it in whatever
decoding it is delivered. You mentioned that the page is delivered in
UTF-8, so you should use whatever gives you that information to decode
the returned body.

Diez