unicode bug in turkish characters?
Oktay Safak
oktaysafak at ixir.com
Wed Apr 2 02:43:38 EST 2003
Hi everybody,
I think there is a problem with the Turkish encoding in Python's
unicode support. I have Python 2.3a instaled on win98/ME. I came
across some strange behaviour while using the re module, and upon
investigating the issue I have narrowed it down to the following
(I set sys.setdefaultencoding("iso-8859-9") in sitecustomize py
and use IDLE) :
When I try to convert the character "i" to uppercase what comes
out is "I" where it should have a dot on top of it instead. Also,
when I try to convert the uppercase i with dot to lowercase, it
comes out as itself where "i" should be the character produced.
I'm unable to give IDLE window output because turkish characters
are involved which I cannot reproduce here. I did some investigation
with both iso-8859-9 and windows-1254 encodings and found that the
ordinal values before and after togglin the case of chars are as
follows:
| original value | after case toggle |
-------------------------------------------------
| 8859-9 | win-1254 | 8859-9 | win-1254 |
--------------------------------------------------------------------
small i w/o dot | 253 | 253 | 73 | 253 |
--------------------------------------------------------------------
small i | 105 | 105 | 73 | 73 |
--------------------------------------------------------------------
Capital I | 73 | 73 | 105 | 105 |
--------------------------------------------------------------------
Capital I with dot | 221 | 221 | 221 | 221 |
THE CORRECT VALUES SHOULD INSTEAD BE:
| original value | after case toggle |
-------------------------------------------------
| 8859-9 | win-1254 | 8859-9 | win-1254 |
--------------------------------------------------------------------
small i w/o dot | 253 | 253 | 73 | 73* |
--------------------------------------------------------------------
small i | 105 | 105 | 221* | 221* |
--------------------------------------------------------------------
Capital I | 73 | 73 | 253* | 253* |
--------------------------------------------------------------------
Capital I with dot | 221 | 221 | 105* | 105* |
as you see only one of the 8 conversions is correct.
I have looked into the encodings directory and saw that the
iso-8859-9 file contains:
decoding_map.update({
0x00d0: 0x011e, # LATIN CAPITAL LETTER G WITH BREVE
0x00dd: 0x0130, # LATIN CAPITAL LETTER I WITH DOT ABOVE ***1
0x00de: 0x015e, # LATIN CAPITAL LETTER S WITH CEDILLA
0x00f0: 0x011f, # LATIN SMALL LETTER G WITH BREVE
0x00fd: 0x0131, # LATIN SMALL LETTER DOTLESS I ***2
0x00fe: 0x015f, # LATIN SMALL LETTER S WITH CEDILLA
})
the *** marked lines are relevant here. The mapping for these in
decimal seem to be:
221:304
and
253:305
respectively. So why aren't these the values I get?
I might be missing something but I have a feeling that this is
a bug since the case toggle works perfectly with turkish
characters that do not exist in ascii. With i and I though, which
do exist in ascii, it's all messed up.
Any ideas?
More information about the Python-list
mailing list