Using re to find unicode ranges

Eric Abrahamsen eric at ericabrahamsen.net
Mon Sep 29 09:17:32 EDT 2008


Is it possible to use the re module to find runs of characters within  
a certain Unicode range?

I'm writing a Markdown extension to go over text and wrap blocks of  
consecutive Chinese characters in <span class="char"></span> tags for  
nice styling in an HTML page. The available hooks appear to be a pre- 
processor (which is a "for line in lines" situation) or an inline  
pattern (which uses regular expressions). The regular expression  
solution would be much simpler and faster, but something tells me  
there's no way to use a regex to find character ranges... Chinese  
characters appear to fall between 19968 and 40959 using ord(), and I  
suppose I can go that route if necessary, but I think it would be ugly.

Any hints or suggestions would be appreciated!

Eric



More information about the Python-list mailing list