Turkic I and re

Thu Sep 15 09:44:54 EDT 2011

On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <me at alanplum.com> wrote:

> On 2011-09-15 15:02, MRAB wrote:
>
>> The regex module at http://pypi.python.org/pypi/**regex<http://pypi.python.org/pypi/regex>currently uses a
>> compromise, where it matches 'I' with 'i' and also 'I' with 'ı' and 'İ'
>> with 'i'.
>>
>> I was wondering if it would be preferable to have a TURKIC flag instead
>> ("(?T)" or "(?T:...)" in the pattern).
>>
>
> I think the problem many people ignore when coming up with solutions like
> this is that while this behaviour is pretty much unique for Turkish script,
> there is no guarantee that Turkish substrings won't appear in other language
> strings (or vice versa).
>
> For example, foreign names in Turkish are often given as spelled in their
> native (non-Turkish) script variants. Likewise, Turkish names in other
> languages are often given as spelled in Turkish.
>
> The Turkish 'I' is a peculiarity that will probably haunt us programmers
> until hell freezes over. Unless Turkey abandons its traditional orthography
> or people start speaking only a single language at a time (including names),
> there's no easy way to deal with this.
>
> In other words: the only way to make use of your proposed flag is if you
> have a fully language-tagged input (e.g. an XML document making extensive
> use of xml:lang) and only ever apply regular expressions to substrings
> containing one culture at a time.
>
> --
> http://mail.python.org/**mailman/listinfo/python-list<http://mail.python.org/mailman/listinfo/python-list>
>

Python does not appear to support special cases mapping, in effect, it is
not 100% compliant with the unicode standard.

The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case
Mappings<http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>)
of the unicode standard.
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180

AFAIK, the case methods of python strings seems to be built around the
assumption that len("string") == len("string".upper()), but some of these
casing rules require that the string grow. Like uppercasing of the german
sharp s "ß" which should be translated to the expanded string "SS".
These special cases should be triggered on specific locales, but I have not
been able to verify that the Turkic uppercasing of "i" works on either
python 2.6, 2.7 or 3.1:

  locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires turkish
locale on your system.
  ord("i".upper()) == 0x130 # is False for me, but should be True

I wouldn't be surprised if these issues are translated into the 're' module.

The only support appears to be 'L' switch, but it only makes "\w, \W, \b, \B,
\s and \S dependent on the current locale".
Which probably does not yield to the special rules mentioned above, but I
could be wrong. Make sure that your locale is correct and test again.

If you are unsuccessful, I don't see a 'Turkic flag' being introduced into
re module any time soon, given the following from PEP 20
"Special cases aren't special enough to break the rules"

Cheers,
-- John-John Tedro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110915/e28f4278/attachment-0001.html>