Using re to find unicode ranges

Mark Tolonen M8R-yfto6h at mailinator.com
Mon Sep 29 11:03:55 EDT 2008


"Eric Abrahamsen" <eric at ericabrahamsen.net> wrote in message 
news:mailman.1674.1222694261.3487.python-list at python.org...
> Is it possible to use the re module to find runs of characters within  a 
> certain Unicode range?
>
> I'm writing a Markdown extension to go over text and wrap blocks of 
> consecutive Chinese characters in <span class="char"></span> tags for 
> nice styling in an HTML page. The available hooks appear to be a pre- 
> processor (which is a "for line in lines" situation) or an inline  pattern 
> (which uses regular expressions). The regular expression  solution would 
> be much simpler and faster, but something tells me  there's no way to use 
> a regex to find character ranges... Chinese  characters appear to fall 
> between 19968 and 40959 using ord(), and I  suppose I can go that route if 
> necessary, but I think it would be ugly.

# coding: utf-8
import re
sample = u'My name is 马克. I am 美国人.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
    print n

output:

马克
美国人

--Mark




More information about the Python-list mailing list