[Baypiggies] quick question: regex to stop naughty control characters

Wed Apr 25 23:38:34 CEST 2007

Kelly Yancey wrote::
> Shannon -jj Behrens wrote::
>> Hi,
>>
>> I'm doing some form validation.  I accept UTF-8 strings and decode
>> them to unicode objects.  I would like to check that the strings are
>> no longer than 128 characters, and that they are "reasonable".  I'm
>> using FormEncode with a regex that looks like r".{1,128}$".  By
>> "reasonable", I think the only thing I want to prevent are control
>> characters.  Now, I'm sure some Unicode whiz out there knows how to do
>> this with some funky Unicode regex magic, but I don't know how.
>> Anyone know the right way to do this?  Should I be worried about more
>> than just control characters?  I'm already taking care of HTML
>> escaping, SQL injection, etc.
>>
>> Thanks,
>> -jj
>>
> 
>   JJ,
> 
>   It ain't pretty, but how about this:
> 
>     ur"(?u)^[\u0000-\u001f\u007f-\u009f]{1,128}$"
> 

   Oops, I forgot to invert the match after pasting it into the email:

      ur"(?u)^[^\u0000-\u001f\u007f-\u009f]{1,128}$"

   Kelly

>   If python's re module implemented POSIX named character classes you 
> could do this:
>     r"(?u)^[^[:cntrl:]]{1,128}$"
> 
> Or if it supported Unicode regular expressions as detailed in 
> http://www.unicode.org/unicode/reports/tr18/, you could do this:
>     r"(?u)^\P{Control}{1,128}$"
> 
> But alas, we aren't there yet. :(
> https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1528154&group_id=5470 
> 
> 
>   I hope that works for you,
> 
>   Kelly
> 
> 
> 

-- 
Kelly Yancey
http://kbyanc.blogspot.com/