RE + UTF-8

cepl@surfbest.net ceplma at gmail.com
Sat Sep 24 19:48:11 EDT 2005


Working on extension of genericwiki.py plugin for PyBlosxom and I have
problems with UTF-8 and RE. When I have this wiki line, it does break
URL too early:

[http://en.wikipedia.org/wiki/Petr_Chelcický Petr Chelcický's]
work(s) into English.

and creates

[<a
href="http://en.wikipedia.org/wiki/Petr_Chel">http://en.wikipedia.org/wiki/Petr_Chel</a>cický
Petr Chelcický's]

The RE genericwiki uses for parsing this:

# WikiName pattern used in your wiki
wikinamepattern = r'\b(([A-Z]\w+){2,})\b' # original
mailurlpattern = r'mailto\:[\"\-\_\.\w]+\@[\-\_\.\w]+\w'
newsurlpattern = r'news\:(?:\w+\.){1,}\w+'
fileurlpattern =
r'(?:http|https|file|ftp):[/-_.\w-]+[\/\w][?&+=%\w/-_.#]*'

[...]

    # Turn '[xxx:address label]' into labeled link
    body = re.sub(r'\[(' +
           fileurlpattern + '|' +
           mailurlpattern + '|' +
           newsurlpattern + ')\s+(.+?)\]',
           r'<a href="\1">\2</a>', body,re.U)

I have tried to test RE and UTF-8 in Python generally and the results
are even more confusing (done with locale cs_CZ.UTF-8 in konsole):

>> locale.getpreferredencoding()
'UTF-8'
>>> print re.sub("(\w*)","X","[Chelcický]",re.L)
X[X?Xý]
>>> print re.sub("(\w*)","X","[Chelcický]",re.UNICODE)
X[X?X?X]X
>>>

I would expect that both print commands should give just plain X, but
apparently Python doesn't undestand that. What's the problem?

Thanks for any reply,

Matej




More information about the Python-list mailing list