[Baypiggies] quick question: regex to stop naughty control characters

Daniel Yoo dyoo at cs.wpi.edu
Thu Apr 26 21:47:26 CEST 2007


> I want to restrict the user to "reasonable things that a name might 
> contain".  Clearly, a tab is invalid.  However, I don't know the regex / 
> unicode syntax to express that I want "normal characters".

Hi JJ,

Ok, knowing this context helps a lot.

According to:

     http://docs.python.org/lib/re-syntax.html

the definition of "\w" can incorporate unicode-ness if we set the UNICODE 
flag.  For example, let's say I have some text so that:

##################################################################
>>> text.encode('utf-8')
'\xed\x95\x98\xeb\xa3\xa8\xeb\x8f\x99\xec\x95\x88 
IDLE\xea\xb0\x80\xec\xa7\x80\xea\xb3\xa0 \xeb\x86\x80\xea\xb8\xb0'
##################################################################


If I'm looking for all the words in the unicode string 'text', the 
following won't work very well:

############################
>>> import re
>>> re.findall('\\w+', text)
[u'IDLE']
############################


The regular expression patter there wasn't unicode aware.  However, this 
one will work:

#########################################################################
>>> re.findall('(?u)\\w+', text)
[u'\ud558\ub8e8\ub3d9\uc548', u'IDLE\uac00\uc9c0\uace0', u'\ub180\uae30']
#########################################################################

and now we can catch those three words as expected.  (Just for reference, 
that unicoded string was the header at the top of 
http://hkn.eecs.berkeley.edu/~dyoo/python/idle_intro/IDLE_korean.html.)


Similarly, '\W' will catch non-word characters, so as long as you set the 
unicode flag up --- either with "(?u)" or by feeding in the explicit 
re.UNICODE flag to re.compile() --- you should be fine.


Good luck!


More information about the Baypiggies mailing list