Using re to find unicode ranges

Eric Abrahamsen girzel at gmail.com
Mon Sep 29 23:46:17 EDT 2008


On Sep 29, 11:03 pm, "Mark Tolonen" <M8R-yft... at mailinator.com> wrote:
> "Eric Abrahamsen" <e... at ericabrahamsen.net> wrote in message
>
> news:mailman.1674.1222694261.3487.python-list at python.org...
>
> > Is it possible to use the re module to find runs of characters within  a
> > certain Unicode range?
>
> > I'm writing a Markdown extension to go over text and wrap blocks of
> > consecutive Chinese characters in <span class="char"></span> tags for
> > nice styling in an HTML page. The available hooks appear to be a pre-
> > processor (which is a "for line in lines" situation) or an inline  pattern
> > (which uses regular expressions). The regular expression  solution would
> > be much simpler and faster, but something tells me  there's no way to use
> > a regex to find character ranges... Chinese  characters appear to fall
> > between 19968 and 40959 using ord(), and I  suppose I can go that route if
> > necessary, but I think it would be ugly.
>
> # coding: utf-8
> import re
> sample = u'My name is 马克. I am 美国人.'
> for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
>     print n

Of course! And obvious, once you point it out. Thanks for the help.



> This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
> WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
> generate executable English Python.  You might give that a look.
> --Mark

Mark - not quite what I'm after here, but pretty interesting
nonetheless...

E



More information about the Python-list mailing list