urllib interpretation of URL with ".."

Mon Jun 25 04:46:22 EDT 2007

"Martin v. Löwis" <martin at v.loewis.de> wrote:

>> Is "urllib" wrong?
> 
> I can't see how. HTTP 1.1 says that the parameter to the GET
> request should be an abs_path; RFC 2396 says that
> /../acatalog/shop.html is indeed an abs_path, as .. is a valid
> segment. That RFC also has a section on relative identifiers
> and normalization; it defines what .. means *in a relative path*.
> 
> Section 4 is explicit about .. in absolute URIs:
> # The syntax for relative URI is a shortened form of that for absolute
> # URI, where some prefix of the URI is missing and certain path
> # components ("." and "..") have a special meaning when, and only when,
> # interpreting a relative path.
> 
> Notice the "and only when": the browsers who modify above
> URL before sending it seem to be in clear violation of
> RFC 2396.

Section 5.2 is also relevant here. In particular:

>       g) If the resulting buffer string still begins with one or more
>          complete path segments of "..", then the reference is
>          considered to be in error.  Implementations may handle this
>          error by retaining these components in the resolved path (i.e.,
>          treating them as part of the final URI), by removing them from
>          the resolved path (i.e., discarding relative levels above the
>          root), or by avoiding traversal of the reference.

The common practice seems to be for client-side implementations to handle 
this using option 2 (removing them) and servers to use option 3 (avoiding 
traversal of the reference). urllib uses option 1 which is also correct but 
not as useful as it might be.