What the \xc2\xa0 ?!!

Tue Sep 7 14:14:36 EDT 2010

On Sep 7, 11:01 am, Brian D <brianden... at gmail.com> wrote:
> In an HTML page that I'm scraping using urllib2, a  \xc2\xa0
> bytestring appears.
>
> The page's charset = utf-8, and the Chrome browser I'm using displays
> the characters as a space.
>
> The page requires authentication:https://www.nolaready.info/myalertlog.php
>
> When I try to concatenate strings containing the bytestring, Python
> chokes because it refuses to coerce the bytestring into ascii.
>
> wfile.write('|'.join(valueList))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 163: ordinal not in range(128)
>
> In searching for help with this issue, I've learned that the
> bytestring *might* represent a non-breaking space.
>
> When I scrape the page using urllib2, however, the characters print
> as  ┬á  in a Windows command prompt (though I wouldn't be surprised if
> this is some erroneous attempt by the antiquated command window to
> handle something it doesn't understand).
>
> If I use IDLE to attempt to decode the single byte referenced in the
> error message, and convert it into UTF-8, another error message is
> generated:
>
> >>> weird = unicode('\xc2', 'utf-8')
>
> Traceback (most recent call last):
>   File "<pyshell#72>", line 1, in <module>
>     weird = unicode('\xc2', 'utf-8')
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
> unexpected end of data
>
> If I attempt to decode the full bytestring, I don't obtain a human-
> readable string (expecting, perhaps, a non-breaking space):
>
> >>> weird = unicode('\xc2\xa0', 'utf-8')
> >>> par = ' - '.join(['This is', weird])
> >>> par
>
> u'This is - \xa0'
>
> I suspect that the bytestring isn't UTF-8, but what is it? Latin1?
>
> >>> weirder = unicode('\xc2\xa0', 'latin1')
> >>> weirder
> u'\xc2\xa0'
> >>> 'This just gets ' + weirder
>
> u'This just gets \xc2\xa0'
>
> Or is it a Microsoft bytestring?
>
> >>> weirder = unicode('\xc2\xa0', 'mbcs')
> >>> 'This just gets ' + weirder
>
> u'This just gets \xc2\xa0'
>
> None of these codecs seem to work.
>
> Back to the original purpose, as I'm scraping the page, I'm storing
> the field/value pair in a dictionary with each iteration through table
> elements on the page. This is all fine, until a value is found that
> contains the offending bytestring. I have attempted to coerce all
> value strings into an encoding, but Python doesn't seem to like that
> when the string is already Unicode:
>
> valuesDict[fieldString] = unicode(value, 'UTF-8')
> TypeError: decoding Unicode is not supported
>
> The solution I've arrived at is to specify the encoding for value
> strings both when reading and writing value strings.
>
> for k, v in valuesDict.iteritems():
>     valuePair = ':'.join([k, v.encode('UTF-8')])
>     [snip] ...
>     wfile.write('|'.join(valueList))
>
> I'm not sure I have a question, but does this sound familiar to any
> Unicode experts out there?
>
> How should I handle these odd bytestring values? Am I doing it
> correctly, or what could I improve?
>
> Thanks!

Since it's UTF-8, one should go to one of the UTF-8 pages that
describes how to decode it. As it turns out, its unicode hex value is
A0, which is indeed a non-breaking space.

This is probably as good as any page: http://en.wikipedia.org/wiki/UTF-8

John Roth