[Baypiggies] quick question: regex to stop naughty control characters
Daniel Yoo
dyoo at cs.wpi.edu
Thu Apr 26 21:47:26 CEST 2007
> I want to restrict the user to "reasonable things that a name might
> contain". Clearly, a tab is invalid. However, I don't know the regex /
> unicode syntax to express that I want "normal characters".
Hi JJ,
Ok, knowing this context helps a lot.
According to:
http://docs.python.org/lib/re-syntax.html
the definition of "\w" can incorporate unicode-ness if we set the UNICODE
flag. For example, let's say I have some text so that:
##################################################################
>>> text.encode('utf-8')
'\xed\x95\x98\xeb\xa3\xa8\xeb\x8f\x99\xec\x95\x88
IDLE\xea\xb0\x80\xec\xa7\x80\xea\xb3\xa0 \xeb\x86\x80\xea\xb8\xb0'
##################################################################
If I'm looking for all the words in the unicode string 'text', the
following won't work very well:
############################
>>> import re
>>> re.findall('\\w+', text)
[u'IDLE']
############################
The regular expression patter there wasn't unicode aware. However, this
one will work:
#########################################################################
>>> re.findall('(?u)\\w+', text)
[u'\ud558\ub8e8\ub3d9\uc548', u'IDLE\uac00\uc9c0\uace0', u'\ub180\uae30']
#########################################################################
and now we can catch those three words as expected. (Just for reference,
that unicoded string was the header at the top of
http://hkn.eecs.berkeley.edu/~dyoo/python/idle_intro/IDLE_korean.html.)
Similarly, '\W' will catch non-word characters, so as long as you set the
unicode flag up --- either with "(?u)" or by feeding in the explicit
re.UNICODE flag to re.compile() --- you should be fine.
Good luck!
More information about the Baypiggies
mailing list