URL parsing for the hard cases

John Nagle nagle at animats.com
Mon Jul 23 00:59:41 EDT 2007


Here's another hard case.  This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator:password@64.105.135.30/originals/6 june 
07/ebay/login/ebayisapi.html'

urlparse.urlparse(s)

yields:

(u'ftp', u'administrator:password at 64.105.135.30', u'/originals/6 june 
07/ebay/login/ebayisapi.html', '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.

That's a real URL, from a search for phishing sites.  There are lots
of hostile URLs out there.  Some of which can fool some parsers.

				John Nagle

John Nagle wrote:
> memracom at yahoo.com wrote:
> 
>> Once you eliminate IPv6 addresses, parsing is simple. Is there a
>> colon? Then there is a port number. Does the left over have any
>> characters not in [0123456789.]? Then it is a name, not an IPv4
>> address.
>>
>> --Michael Dillon
>>
> 
>   You wish.  Hex input of IP addresses is allowed:
> 
>     http://0x525eedda
> 
> and
> 
>     http://0x52.0x5e.0xed.0xda
> 
> are both "Python.org".  Or just put
> 
>     0x52.0x5e.0xed.0xda
> 
> into the address bar of a browser.  All these work in Firefox on Windows 
> and
> are recognized as valid IP addresses.
> 
> On the other hand,
>     
>     0x52.com
> 
> is a valid domain name, in use by PairNIC.
> 
> But
> 
>     http://test.0xda
> 
> is handled by Firefox on Windows as a domain name.  It doesn't resolve, 
> but it's
> sent to DNS.
> 
> So I think the question is whether every term between dots can be parsed as
> a decimal or hex number.  If all terms can be parsed as a number, and 
> there are
> no more than four of them, it's an IP address.  Otherwise it's a domain 
> name.
> 
> There are phishing sites that pull stuff like this, and I'm parsing a 
> long list
> of such sites.  So I really do need to get the hard cases right.
> 
> Is there any library function that correctly tests for an IP address vs. a
> domain name based on syntax, i.e. without looking it up in DNS?
> 
>                 John Nagle



More information about the Python-list mailing list