Re: problems with  character
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Wed Mar 23 17:21:40 EST 2005
In <1111599099.854099.60300 at g14g2000cwa.googlegroups.com>, jdonnell wrote:
> Thanks for all the replies. I just got in to work so I haven't tried
> any of them yet. I see that I wasn't as clear as I should have been so
> I'll clarify a little. I'm grabbing some data from msn's rss feed.
> Here's an example.
> http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE
Then you are getting UTF-8 encoded strings.
> The string ' all domain name extensions » Good' is where I have a
> problem. The
> ' »' shows up as '  »' when I write it to a file or stick
> it in mysql. I did a hex dump and this is what I see.
>
> jay at localhost:~/scripts> cat test.txt
> extensions » Good
> jay at localhost:~/scripts> xxd test.txt
> 0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
> 0000010: 20c2 bb20 476f 6f64 0a .. Good
>
> One thing that jumps out is that two of the Â's are c2a0, but one of
> them is c2bb. Well, those are the details since I wasn't clear before.
That are two no-break spaces and a '»' character::
In [42]: import unicodedata
In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
Out[43]: 'NO-BREAK SPACE'
In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list