Ann: Validating Emails and HTTP URLs in Python

Tue May 4 08:02:51 EDT 2010

First, it's good to see a library has URL and email validator.

But I found there might be a problem in your validator, the problems I
found are these URLs:

  http://example.com/path
  http://example.com/path)
  http://example.com/path]
  http://example.com/path}

By my understanding from RFCs, only first two are valid.

  >>> from lepl.apps.rfc3696 import *
  >>> v = HttpUrl()
  >>> v('http://example.com/')
  True
  >>> v('http://example.com/path')
  True
  >>> v('http://example.com/path)')
  True
  >>> v('http://example.com/path]')
  True
  >>> v('http://example.com/path}')
  True

You use RFC 3969 [1] to write your code (I read your source code,
lepl.apps.rfc3696._HttpUrl()), I think your code should only return
True for first case, but all return True. Maybe I use it incorrectly?

And I think that has a slight issue because RFC 3969 was written based
on RFC 2396 [2], which is obsoleted by RFC 3986 [3]. I never really
read RFC 3969, I am not sure if there is problem.

But in RFC 3969, it writes

   The following characters are reserved in many URIs -- they must be
   used for either their URI-intended purpose or must be encoded.
Some
   particular schemes may either broaden or relax these restrictions
   (see the following sections for URLs applicable to "web pages" and
   electronic mail), or apply them only to particular URI component
   parts.

      ; / ? : @ & = + $ , ?

However in RFC 2396 (the obsoleted RFC), "3.3. Path Component,"

   The path component contains data, specific to the authority (or the
   scheme if there is no authority component), identifying the
resource
   within the scope of that scheme and authority.

      path          = [ abs_path | opaque_part ]

      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar

      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

Here is unreserved of pchar:

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" |
")"

In RFC 3986, they are a bit different, but my point here is "(" and
")".

The Uri from 4Suite return the results I expect:

  >>> import Ft.Lib.Uri as U
  >>> U.MatchesUriSyntax('http://example.com/path')
  True
  >>> U.MatchesUriSyntax('http://example.com/path)')
  True
  >>> U.MatchesUriSyntax('http://example.com/path}')
  False
  >>> U.MatchesUriSyntax('http://example.com/path]')
  False

I think you should use (read) RFC 3986 not RFC 3696 for URL
validation.

One more thing, HttpUrl()'s docstring should s/email/url/.

[1]: http://tools.ietf.org/html/rfc3696
[2]: http://tools.ietf.org/html/rfc2396
[3]: http://tools.ietf.org/html/rfc3986