[Baypiggies] quick question: regex to stop naughty control characters
Tung Wai Yip
tungwaiyip at yahoo.com
Thu Apr 26 21:06:09 CEST 2007
> The decoding happens first. Then I check for *character* length and
> "reasonableness".
>
>> > By "reasonable", I think the only thing I want to prevent are control
>> > characters.
>>
>> What do you mean by a "control character"? Can you be more specific
>> about
>> the context that you're trying to guard?
>
> Back space characters, newlines, etc.
You can pretty check for ord(c) < 32 for control characters. Subsequently
watch out for things like < or " for code injection. Anything outside of
the ASCII range is pretty much safe characters. I'm not aware of any usage
of those as control characters.
Only other caveat I can think of is when you do .decode('UTF-8'), it may
fail as not all binary sequences are valid UTF-8. Perhaps it may come from
faulty web crawler or malicious code.
Wai Yip
More information about the Baypiggies
mailing list