Turkic I and re

Yaşar Arabacı yasar11732 at gmail.com
Thu Sep 15 11:04:27 EDT 2011


Hi,

I am a Turkish self-taught python user. Personally, I don't think I am in a
position to discuss a issue in this scale. But in my opinion, I think
pardus* developers should be invited to join to this discussion. As they are
using python heavily on most of their projects** I think they would have
something valueable to say about this subject. Here is the pardus-developers
mailing list : http://liste.pardus.org.tr/mailman/listinfo/pardus-devel

And as for me, I always expect Turkish locale might cause problems, and use
some workarounds if neccessary. For example, If I needed to match lower-case
or upper-case Turkish "i", I would probably go with [iİ] with unicode flag.


*) a linux distro developed by  Scientific & Technological Research Council
of Turkey
**) http://developer.pardus.org.tr/projects/index.html



2011/9/15 MRAB <python at mrabarnett.plus.com>

> On 15/09/2011 14:44, John-John Tedro wrote:
>
>> On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <me at alanplum.com
>> <mailto:me at alanplum.com>> wrote:
>>
>>    On 2011-09-15 15:02, MRAB wrote:
>>
>>        The regex module at http://pypi.python.org/pypi/__**regex<http://pypi.python.org/pypi/__regex>
>>        <http://pypi.python.org/pypi/**regex<http://pypi.python.org/pypi/regex>>
>> currently uses a
>>        compromise, where it matches 'I' with 'i' and also 'I' with 'ı'
>>        and 'İ'
>>        with 'i'.
>>
>>        I was wondering if it would be preferable to have a TURKIC flag
>>        instead
>>        ("(?T)" or "(?T:...)" in the pattern).
>>
>>
>>    I think the problem many people ignore when coming up with solutions
>>    like this is that while this behaviour is pretty much unique for
>>    Turkish script, there is no guarantee that Turkish substrings won't
>>    appear in other language strings (or vice versa).
>>
>>    For example, foreign names in Turkish are often given as spelled in
>>    their native (non-Turkish) script variants. Likewise, Turkish names
>>    in other languages are often given as spelled in Turkish.
>>
>>    The Turkish 'I' is a peculiarity that will probably haunt us
>>    programmers until hell freezes over. Unless Turkey abandons its
>>    traditional orthography or people start speaking only a single
>>    language at a time (including names), there's no easy way to deal
>>    with this.
>>
>>    In other words: the only way to make use of your proposed flag is if
>>    you have a fully language-tagged input (e.g. an XML document making
>>    extensive use of xml:lang) and only ever apply regular expressions
>>    to substrings containing one culture at a time.
>>
>>    --
>>    http://mail.python.org/__**mailman/listinfo/python-list<http://mail.python.org/__mailman/listinfo/python-list>
>>    <http://mail.python.org/**mailman/listinfo/python-list<http://mail.python.org/mailman/listinfo/python-list>
>> >
>>
>>
>> Python does not appear to support special cases mapping, in effect, it
>> is not 100% compliant with the unicode standard.
>>
>> The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case
>> Mappings <http://www.unicode.org/**versions/Unicode6.0.0/ch05.**
>> pdf#G21180 <http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>
>> >)
>>
>> of the unicode standard.
>> http://www.unicode.org/**versions/Unicode6.0.0/ch05.**pdf#G21180<http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>
>>
>> AFAIK, the case methods of python strings seems to be built around the
>> assumption that len("string") == len("string".upper()), but some of
>> these casing rules require that the string grow. Like uppercasing of the
>> german sharp s "ß" which should be translated to the expanded string "SS".
>> These special cases should be triggered on specific locales, but I have
>> not been able to verify that the Turkic uppercasing of "i" works on
>> either python 2.6, 2.7 or 3.1:
>>
>>   locale.setlocale(locale.LC_**ALL, "tr_TR.utf8") # warning, requires
>> turkish locale on your system.
>>   ord("i".upper()) == 0x130 # is False for me, but should be True
>>
>> I wouldn't be surprised if these issues are translated into the 're'
>> module.
>>
>>  There has been some discussion on the Python-dev list about improving
> Unicode support in Python 3.
>
> It's somewhat unlikely that Unicode will become locale-dependent in
> Python because it would cause problems; you don't want:
>
>    "i".upper() == "I"
>
> to be maybe true, maybe false.
>
> An option would be to specify whether it should be locale-dependent.
>
>
>  The only support appears to be 'L' switch, but it only makes "\w, \W,
>> \b, \B, \s and \S dependent on the current locale".
>>
>
> That flag is for locale-dependent 8-bit encodings. The ASCII (Python
> 3), LOCALE and UNICODE flags are mutually exclusive.
>
>
>  Which probably does not yield to the special rules mentioned above, but
>> I could be wrong. Make sure that your locale is correct and test again.
>>
>> If you are unsuccessful, I don't see a 'Turkic flag' being introduced
>> into re module any time soon, given the following from PEP 20
>> "Special cases aren't special enough to break the rules"
>>
>>  That's why I'm interested in the view of Turkish users. The rest of us
> will probably never have to worry about it! :-)
>
> (There's a report in the Python bug tracker about this issue, which is
> why the regex module has the compromise.)
>
> --
> http://mail.python.org/**mailman/listinfo/python-list<http://mail.python.org/mailman/listinfo/python-list>
>



-- 
http://yasar.serveblog.net/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110915/1b00d567/attachment-0001.html>


More information about the Python-list mailing list