[New-bugs-announce] [issue22891] [patch]: code removal from urllib.parse.urlsplit()
Alexander Todorov
report at bugs.python.org
Mon Nov 17 11:42:24 CET 2014
New submission from Alexander Todorov:
In the urllib.parse (or urlparse on Python 2.X) module there is this function:
157 def urlsplit(url, scheme='', allow_fragments=True):
158 """Parse a URL into 5 components:
159 <scheme>://<netloc>/<path>?<query>#<fragment>
160 Return a 5-tuple: (scheme, netloc, path, query, fragment).
161 Note that we don't break the components up in smaller bits
162 (e.g. netloc is a single string) and we don't expand % escapes."""
163 allow_fragments = bool(allow_fragments)
164 key = url, scheme, allow_fragments, type(url), type(scheme)
165 cached = _parse_cache.get(key, None)
166 if cached:
167 return cached
168 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
169 clear_cache()
170 netloc = query = fragment = ''
171 i = url.find(':')
172 if i > 0:
173 if url[:i] == 'http': # optimize the common case
174 scheme = url[:i].lower()
175 url = url[i+1:]
176 if url[:2] == '//':
177 netloc, url = _splitnetloc(url, 2)
178 if allow_fragments and '#' in url:
179 url, fragment = url.split('#', 1)
180 if '?' in url:
181 url, query = url.split('?', 1)
182 v = SplitResult(scheme, netloc, url, query, fragment)
183 _parse_cache[key] = v
184 return v
185 for c in url[:i]:
186 if c not in scheme_chars:
187 break
188 else:
189 scheme, url = url[:i].lower(), url[i+1:]
190
191 if url[:2] == '//':
192 netloc, url = _splitnetloc(url, 2)
193 if allow_fragments and '#' in url:
194 url, fragment = url.split('#', 1)
195 if '?' in url:
196 url, query = url.split('?', 1)
197 v = SplitResult(scheme, netloc, url, query, fragment)
198 _parse_cache[key] = v
199 return v
There is an issue here (or a few of them) as follows:
* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on line 174 is reduntant:
174 scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"
* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
173 if url[:i].lower() == 'http': # optimize the common case
* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")
but the last 2 invocations also go through lines 185-199
* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.
My personal vote goes for removal of lines 173-184.
Version-Release number of selected component (if applicable):
This is present in both Python 3 and Python 2 on all versions I have access to:
python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64
Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.
Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603
and now filing here for upstream consideration.
----------
components: Library (Lib)
files: urlparse_refactor.patch
keywords: patch
messages: 231278
nosy: Alexander.Todorov
priority: normal
severity: normal
status: open
title: [patch]: code removal from urllib.parse.urlsplit()
type: enhancement
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file37212/urlparse_refactor.patch
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22891>
_______________________________________
More information about the New-bugs-announce
mailing list