urlparse.urlparse bug - misparses long URL

Fri Dec 14 02:38:20 EST 2007

John Nagle wrote:
> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
> ====
> http://www.midamericabank61.com.mx?email_from=gpatti@Tezzaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBBQBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZXgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L%fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99%ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d%e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https://www.midamericabank.com/my_acccounts/default.aspxL0PWSjXev6xlkMTqVKFbLUgrh8CBquCchip4PuQDWYLYpzDGOFkLZyY
> ====
> What we get back in the "accesshost" field (i.e. the domain name) is
> 
> ====
> 'www.midamericabank61.com.mx?email_from=gpatti at Tezzaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBBQBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZXgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L%fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99%ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d%e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https:'
> ====
> 
> which is wrong.  Something far out in that URL is breaking urlparse, and it's 
> not able to extract the domain name properly.
> 
> It's not a UNICODE issue; forced the data to "str" and it still mis-parses.
> 
> I'm trying to construct s shorter string that fails.  More to follow.
> 
> (Yes, another error associated with the wonderful world of parsing hostile sites 
> in Python.  This is from a phishing attack, and that URL is in PhishTank.)
> 
> 					John Nagle
> 					SiteTruth

It's breaking on the first slash, which just happens to be very late in
the URL.

>>> urlparse('http://example.com?blahblah=http://example.net')
('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
--