Re: problems with Â character

Wed Mar 23 17:21:40 EST 2005

In <1111599099.854099.60300 at g14g2000cwa.googlegroups.com>, jdonnell wrote:

> Thanks for all the replies. I just got in to work so I haven't tried
> any of them yet. I see that I wasn't as clear as I should have been so
> I'll clarify a little. I'm grabbing some data from msn's rss feed.
> Here's an example.
> http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE

Then you are getting UTF-8 encoded strings.

> The string ' all domain name extensions     » Good' is where I have a
> problem. The
> '    »' shows up as  'Â  Â  Â»' when I write it to a file or stick
> it in mysql. I did a hex dump and this is what I see.
> 
> jay at localhost:~/scripts> cat test.txt
> extensions     » Good
> jay at localhost:~/scripts> xxd test.txt
> 0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0  extensions .. ..
> 0000010: 20c2 bb20 476f 6f64 0a                    .. Good
> 
> One thing that jumps out is that two of the Â's are c2a0, but one of
> them is c2bb. Well, those are the details since I wasn't clear before.

That are two no-break spaces and a '»' character::

  In [42]: import unicodedata

  In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
  Out[43]: 'NO-BREAK SPACE'

  In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
  Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
	Marc 'BlackJack' Rintsch