Using re to find unicode ranges

Paul McGuire ptmcg at austin.rr.com
Mon Sep 29 10:45:47 EDT 2008


On Sep 29, 8:17 am, Eric Abrahamsen <e... at ericabrahamsen.net> wrote:
> Is it possible to use the re module to find runs of characters within  
> a certain Unicode range?
>
> I'm writing a Markdown extension to go over text and wrap blocks of  
> consecutive Chinese characters in <span class="char"></span> tags for  
> nice styling in an HTML page. The available hooks appear to be a pre-
> processor (which is a "for line in lines" situation) or an inline  
> pattern (which uses regular expressions). The regular expression  
> solution would be much simpler and faster, but something tells me  
> there's no way to use a regex to find character ranges... Chinese  
> characters appear to fall between 19968 and 40959 using ord(), and I  
> suppose I can go that route if necessary, but I think it would be ugly.
>
> Any hints or suggestions would be appreciated!
>
> Eric

Eric -

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python.  You might give that a look.

-- Paul



More information about the Python-list mailing list