[Baypiggies] quick question: regex to stop naughty control characters
Shannon -jj Behrens
jjinux at gmail.com
Thu Apr 26 20:45:13 CEST 2007
On 4/25/07, Daniel Yoo <dyoo at cs.wpi.edu> wrote:
> Hi JJ,
>
> The question is slightly underdefined still; do you mind if I ask a few
> more questions?
>
>
> > I accept UTF-8 strings and decode them to unicode objects.
>
> Ok, so what we really have are bytes whose intended interpretation is
> utf-8, yes?
> Is the input a unicode string? Or is it rather a sequence of
> bytes (which Python often uses a regular string for)?
I have Python unicode objects, i.e. the thing you get back when you do
"whatever".decode('utf-8')
> > I would like to check that the strings are no longer than 128 characters
>
> Unfortunately, "characters" is ambiguous and has at least two meanings
> these days. Do you mean 128 bytes, or 128 unicode characters? There's a
> slight ambiguity here that needs to be cleared up before this problem can
> be attacked.
Naturally ;) Furthermore, Unicode complicates this mess even more by
permitting some characters to be represented in multiple ways.
> Also, what part of this really requires regular expressions here? What
> you've shown so far restricts a string by length, but that's already a
> simpler conditional:
>
> len(some_string) < 128
>
> I have to assume it has something to do with the definition of
> reasonableness.
Exactly. I want it to be 128 *characters* or less, and I want all of
those characters to be *reasonable* for some unclear definition of
reasonableness. I know that control characters are clearly
unreasonable. I'm not sure if I should restrict anything else.
> Does the check for reasonableness have to happen at the same time as the
> test for length?
Not necessarily.
> Must the check for reasonableness happen before decoding
> bytes assuming a utf-8 interpretation? Or can something like:
>
> return (len(some_string < 128 and
> is_reasonable(decode(some_string, 'utf-8')))
>
> suffice?
The decoding happens first. Then I check for *character* length and
"reasonableness".
> > By "reasonable", I think the only thing I want to prevent are control
> > characters.
>
> What do you mean by a "control character"? Can you be more specific about
> the context that you're trying to guard?
Back space characters, newlines, etc.
> I apologize about being pedantic, but form validation needs to be handled
> methodically to be valuable.
Agreed. That's why I'm asking ;)
Thanks Daniel!
-jj
--
http://jjinux.blogspot.com/
More information about the Baypiggies
mailing list