[issue25403] urllib.parse.urljoin is broken in python 3.5

Martin Panter report at bugs.python.org
Thu Oct 15 22:19:30 EDT 2015


Martin Panter added the comment:

It is true that 3.5 is meant to follow RFC 3986, which obsoletes RFC 1808 and specifies slightly different behaviour for abnormal cases. This change is documented under urljoin(), and also in “What’s New in 3.5”. Pavel’s first case is one of these differences in the RFCs, and I don’t think it is a bug. According to <https://tools.ietf.org/html/rfc3986.html#section-5.2.4>,

“The remove_dot_segments algorithm respects [the base’s] hierarchy by removing extra dot-segments rather than treating them as an error or leaving them to be misinterpreted by dereference implementations.”

For Pavel’s second and third cases, RFC 3986 doesn’t cover them directly because the base URL is relative. The RFC only covers absolute base URLs, which start with a scheme like “http:”. The documentation doesn’t really bless these cases either: ‘Construct a full (“absolute”) URL’. However there is explicit support in the source code ("" in urllib.parse.uses_relative).

It looks like 3.5 is strict in following the RFC’s Remove Dot Segments algorithm. Step 2C says that for “/../” or “/..”, the parent segment is removed, but the input is always replaced with “/”:

“a/..” → “/”
“a/../..” → “/..” → “/”

I would prefer a less strict interpretation of the spirit of the algorithm. Do not introduce a slash in the input if you did not remove one from the output buffer:

“a/..” → empty URL
“a/../..” → “..” → empty URL

Python 3.4 and earlier did not behave sensibly if you extend the relative URL:

>>> urljoin("a/", "..")
''
>>> urljoin("a/", "../..")
'..'
>>> urljoin("a/", "../../..")
''
>>> urljoin("a/", "../../../..")
'../'

Pavel, what behaviour would you expect in these cases? My empty URL interpretation, or perhaps a more sensible version of the Python 3.4 behaviour? What is your use case?

One related more serious (IMO) regression I noticed compared to 3.4, where the path becomes a host name:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file://host/oops'

----------
components:  -Interpreter Core

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25403>
_______________________________________


More information about the Python-bugs-list mailing list