Normalize a polish L

Rob Wolfe rw at smsnet.pl
Mon Oct 15 16:00:23 EDT 2007


Peter Bengtsson <peterbe at gmail.com> writes:

> In UTF8, \u0141 is a capital L with a little dash through it as can be
> seen in this image:
> http://static.peterbe.com/lukasz.png
>
> I tried this:
>>>> import unicodedata
>>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
> ''
>
> I was hoping it would convert it it 'L' because that's what it
> visually looks like. And I've seen it becoming a normal ascii L before
> in other programs such as Thunderbird.
>
> I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
> none of them helped.
>
> What am I doing wrong?

I had the same problem and my little research revealed that the problem
is caused by unicode standard itself. I don't know why
but characters with stroke don't have canonical equivalent.
I looked into this file:
http://unicode.org/Public/UNIDATA/UnicodeData.txt

and compared two positions:

1.
<UnicodeData.txt>
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
;;0141;;0141
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
;;;0142;
</UnicodeData.txt> 

2.
<UnicodeData.txt>
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
;;0104;;0104
</UnicodeData.txt>

In the second position there is in the 6-th field canonical equivalent
but in the 1-st there is nothing. I don't know what justification
is behind that, but probably there is something. ;)


Regards,
Rob





More information about the Python-list mailing list