[issue20271] urllib.parse.urlparse() accepts wrong URLs

Fri Mar 6 19:29:59 CET 2015

Demian Brecht added the comment:

I think some further consideration around this change is worthwhile:

Currently, urllib.parse.split* methods seem to do inconsistent validation around the data they're splitting. For example:

(None for an invalid port)
>>> parse.splitnport('example.com:foo')
('example.com', None)

Whereas other split* methods do not:

(Auth part should be URL-encoded)
>>> parse.splituser('u at ser:p at ssword@example.com:80')
('u at ser:p at ssword', 'example.com:80')

And others are just plain incorrect:

(:bad should be the port as defined by the ABNF 'authority = [ userinfo "@" ] host [ ":" port ]')
>>> parse.splitport('example.com:bad')
('example.com:bad', None)

However, none of these cases (currently) raise exceptions.

Looking at urllib.parse, two large motivations behind it are splitting and parsing. In my mind, splitting should solely be responsible for splitting input based on RFC-defined delimiters. Parsing on the other hand, should be responsible for both splitting as necessary as well as input validation. It may also make sense to add simple validation functions to the module to comply with the "batteries included" philosophy, but that's a topic for another issue.

My concern with the proposed patch is that it adds further inconsistency to split* methods:

Before patch:

>>> parse.urlsplit('http://[::1]spam:80')
SplitResult(scheme='http', netloc='[::1]spam:80', path='', query='', fragment='')

After patch:

>>> parse.urlsplit('http://[::1]spam:80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 350, in urlsplit
    netloc, url = _splitnetloc(url, 2)
  File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 324, in _splitnetloc
    raise ValueError('Invalid IPv6 URL')

(While the above examples still yield the same results and don't raise exceptions)

I do think that the validation should be done and I agree that an exception should be raised, but it should be done at the urlparse level, not on split (in this case, due to the change to _splitnecloc).

----------
nosy: +demian.brecht

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20271>
_______________________________________