[Baypiggies] quick question: regex to stop naughty control characters

Thu Apr 26 20:45:13 CEST 2007

On 4/25/07, Daniel Yoo <dyoo at cs.wpi.edu> wrote:
> Hi JJ,
>
> The question is slightly underdefined still; do you mind if I ask a few
> more questions?
>
>
> > I accept UTF-8 strings and decode them to unicode objects.
>
> Ok, so what we really have are bytes whose intended interpretation is
> utf-8, yes?
> Is the input a unicode string?  Or is it rather a sequence of
> bytes (which Python often uses a regular string for)?

I have Python unicode objects, i.e. the thing you get back when you do
"whatever".decode('utf-8')

> > I would like to check that the strings are no longer than 128 characters
>
> Unfortunately, "characters" is ambiguous and has at least two meanings
> these days.  Do you mean 128 bytes, or 128 unicode characters?  There's a
> slight ambiguity here that needs to be cleared up before this problem can
> be attacked.

Naturally ;)  Furthermore, Unicode complicates this mess even more by
permitting some characters to be represented in multiple ways.

> Also, what part of this really requires regular expressions here?  What
> you've shown so far restricts a string by length, but that's already a
> simpler conditional:
>
>      len(some_string) < 128
>
> I have to assume it has something to do with the definition of
> reasonableness.

Exactly.  I want it to be 128 *characters* or less, and I want all of
those characters to be *reasonable* for some unclear definition of
reasonableness.  I know that control characters are clearly
unreasonable.  I'm not sure if I should restrict anything else.

> Does the check for reasonableness have to happen at the same time as the
> test for length?

Not necessarily.

> Must the check for reasonableness happen before decoding
> bytes assuming a utf-8 interpretation?  Or can something like:
>
>      return (len(some_string < 128 and
>              is_reasonable(decode(some_string, 'utf-8')))
>
> suffice?

The decoding happens first.  Then I check for *character* length and
"reasonableness".

> > By "reasonable", I think the only thing I want to prevent are control
> > characters.
>
> What do you mean by a "control character"?  Can you be more specific about
> the context that you're trying to guard?

Back space characters, newlines, etc.

> I apologize about being pedantic, but form validation needs to be handled
> methodically to be valuable.

Agreed.  That's why I'm asking ;)

Thanks Daniel!
-jj

-- 
http://jjinux.blogspot.com/