Using re to find unicode ranges
Mark Tolonen
M8R-yfto6h at mailinator.com
Mon Sep 29 11:03:55 EDT 2008
"Eric Abrahamsen" <eric at ericabrahamsen.net> wrote in message
news:mailman.1674.1222694261.3487.python-list at python.org...
> Is it possible to use the re module to find runs of characters within a
> certain Unicode range?
>
> I'm writing a Markdown extension to go over text and wrap blocks of
> consecutive Chinese characters in <span class="char"></span> tags for
> nice styling in an HTML page. The available hooks appear to be a pre-
> processor (which is a "for line in lines" situation) or an inline pattern
> (which uses regular expressions). The regular expression solution would
> be much simpler and faster, but something tells me there's no way to use
> a regex to find character ranges... Chinese characters appear to fall
> between 19968 and 40959 using ord(), and I suppose I can go that route if
> necessary, but I think it would be ugly.
# coding: utf-8
import re
sample = u'My name is 马克. I am 美国人.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n
output:
马克
美国人
--Mark
More information about the Python-list
mailing list