urllib interpretation of URL with ".."
Duncan Booth
duncan.booth at invalid.invalid
Mon Jun 25 04:46:22 EDT 2007
"Martin v. Löwis" <martin at v.loewis.de> wrote:
>> Is "urllib" wrong?
>
> I can't see how. HTTP 1.1 says that the parameter to the GET
> request should be an abs_path; RFC 2396 says that
> /../acatalog/shop.html is indeed an abs_path, as .. is a valid
> segment. That RFC also has a section on relative identifiers
> and normalization; it defines what .. means *in a relative path*.
>
> Section 4 is explicit about .. in absolute URIs:
> # The syntax for relative URI is a shortened form of that for absolute
> # URI, where some prefix of the URI is missing and certain path
> # components ("." and "..") have a special meaning when, and only when,
> # interpreting a relative path.
>
> Notice the "and only when": the browsers who modify above
> URL before sending it seem to be in clear violation of
> RFC 2396.
Section 5.2 is also relevant here. In particular:
> g) If the resulting buffer string still begins with one or more
> complete path segments of "..", then the reference is
> considered to be in error. Implementations may handle this
> error by retaining these components in the resolved path (i.e.,
> treating them as part of the final URI), by removing them from
> the resolved path (i.e., discarding relative levels above the
> root), or by avoiding traversal of the reference.
The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.
More information about the Python-list
mailing list