unicode bug in turkish characters?

Oktay Safak oktaysafak at ixir.com
Wed Apr 2 02:43:38 EST 2003


Hi everybody,

I think there is a problem with the Turkish encoding in Python's 
unicode support. I have Python 2.3a instaled on win98/ME. I came
across some strange behaviour while using the re module, and upon
investigating the issue I have narrowed it down to the following
(I set sys.setdefaultencoding("iso-8859-9") in sitecustomize py
and use IDLE) :

When I try to convert the character "i" to uppercase what comes
out is "I" where it should have a dot on top of it instead. Also,
when I try to convert the uppercase i with dot to lowercase, it
comes out as itself where "i" should be the character produced.
I'm unable to give IDLE window output because turkish characters
are involved which I cannot reproduce here. I did some investigation
with both iso-8859-9 and windows-1254 encodings and found that the
ordinal values before and after togglin the case of chars are as
follows:


                   |     original value     |   after case toggle  |
                   -------------------------------------------------
                   |    8859-9 | win-1254   |   8859-9 | win-1254  |
--------------------------------------------------------------------
small i w/o dot    |    253    |    253     |    73    |     253   |
--------------------------------------------------------------------
small i            |    105    |    105     |    73    |     73    |
--------------------------------------------------------------------
Capital I          |     73    |     73     |    105   |     105   |
--------------------------------------------------------------------
Capital I with dot |     221   |     221    |    221   |     221   |


THE CORRECT VALUES SHOULD INSTEAD BE:

                   |     original value     |   after case toggle  |
                   -------------------------------------------------
                   |    8859-9 | win-1254   |   8859-9 | win-1254  |
--------------------------------------------------------------------
small i w/o dot    |    253    |    253     |    73    |     73*   |
--------------------------------------------------------------------
small i            |    105    |    105     |    221*  |     221*  |
--------------------------------------------------------------------
Capital I          |     73    |     73     |    253*  |     253*  |
--------------------------------------------------------------------
Capital I with dot |     221   |     221    |    105*  |     105*  |


as you see only one of the 8 conversions is correct.

I have looked into the encodings directory and saw that the
iso-8859-9 file contains:

decoding_map.update({
        0x00d0: 0x011e, #  LATIN CAPITAL LETTER G WITH BREVE
        0x00dd: 0x0130, #  LATIN CAPITAL LETTER I WITH DOT ABOVE  ***1
        0x00de: 0x015e, #  LATIN CAPITAL LETTER S WITH CEDILLA
        0x00f0: 0x011f, #  LATIN SMALL LETTER G WITH BREVE
        0x00fd: 0x0131, #  LATIN SMALL LETTER DOTLESS I           ***2
        0x00fe: 0x015f, #  LATIN SMALL LETTER S WITH CEDILLA
})


the *** marked lines are relevant here. The mapping for these in
decimal seem to be:

 221:304

and

 253:305

respectively. So why aren't these the values I get?
I might be missing something but I have a feeling that this is
a bug since the case toggle works perfectly with turkish
characters that do not exist in ascii. With i and I though, which
do exist in ascii, it's all messed up.

Any ideas?






More information about the Python-list mailing list