urllib.parser.quote() and RFC 2396: unreserved characters get encoded

Bruno Cauet brunocauet at gmail.com
Wed Feb 11 05:03:55 EST 2015


Hi,
I believe that pathlib and urllib.parse.quote() do not correctly build URIs.
According to RFC 2396 (http://tools.ietf.org/html/rfc2396.html),
unreserved characters such as "!" should not be escaped (section 2.3):
   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

Unicode characters outside the ASCII range also get encoded when they
have no reason to, e.g.
   >>> pathlib.PurePath("/home/싸이/").as_uri()
   'file:///home/%EC%8B%B8%EC%9D%B4'
while the result should simply be 'file:///home/싸이'

The documentation
(https://docs.python.org/3.5/library/urllib.parse.html) mentions RFC
2396, so I think it should be respected.
Am I missing something? Is it a bug?

Source-wise, the problem stems from usage of
urllib.parse.quote_from_bytes() which quotes everything. I'm not sure
whether the responsibility lies, but I believe that
urllib.parse.quote() is affected by the same problem, even though it
mentions RFC 2396 in its docstring.

The problem I have because of this is that URIs generated by python
does not match the URIs generated by other tools and make me unable to
identify elements as identical because of their URI.

Thanks,
Bruno



More information about the Python-list mailing list