[Baypiggies] quick question: regex to stop naughty control characters

Tung Wai Yip tungwaiyip at yahoo.com
Thu Apr 26 21:06:09 CEST 2007


> The decoding happens first.  Then I check for *character* length and
> "reasonableness".
>
>> > By "reasonable", I think the only thing I want to prevent are control
>> > characters.
>>
>> What do you mean by a "control character"?  Can you be more specific  
>> about
>> the context that you're trying to guard?
>
> Back space characters, newlines, etc.

You can pretty check for ord(c) < 32 for control characters. Subsequently  
watch out for things like < or " for code injection. Anything outside of  
the ASCII range is pretty much safe characters. I'm not aware of any usage  
of those as control characters.

Only other caveat I can think of is when you do .decode('UTF-8'), it may  
fail as not all binary sequences are valid UTF-8. Perhaps it may come from  
faulty web crawler or malicious code.

Wai Yip



More information about the Baypiggies mailing list