URL parsing for the hard cases
John Nagle
nagle at animats.com
Mon Jul 23 00:59:41 EDT 2007
Here's another hard case. This one might be a bug in urlparse:
import urlparse
s = 'ftp://administrator:password@64.105.135.30/originals/6 june
07/ebay/login/ebayisapi.html'
urlparse.urlparse(s)
yields:
(u'ftp', u'administrator:password at 64.105.135.30', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')
That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.
That's a real URL, from a search for phishing sites. There are lots
of hostile URLs out there. Some of which can fool some parsers.
John Nagle
John Nagle wrote:
> memracom at yahoo.com wrote:
>
>> Once you eliminate IPv6 addresses, parsing is simple. Is there a
>> colon? Then there is a port number. Does the left over have any
>> characters not in [0123456789.]? Then it is a name, not an IPv4
>> address.
>>
>> --Michael Dillon
>>
>
> You wish. Hex input of IP addresses is allowed:
>
> http://0x525eedda
>
> and
>
> http://0x52.0x5e.0xed.0xda
>
> are both "Python.org". Or just put
>
> 0x52.0x5e.0xed.0xda
>
> into the address bar of a browser. All these work in Firefox on Windows
> and
> are recognized as valid IP addresses.
>
> On the other hand,
>
> 0x52.com
>
> is a valid domain name, in use by PairNIC.
>
> But
>
> http://test.0xda
>
> is handled by Firefox on Windows as a domain name. It doesn't resolve,
> but it's
> sent to DNS.
>
> So I think the question is whether every term between dots can be parsed as
> a decimal or hex number. If all terms can be parsed as a number, and
> there are
> no more than four of them, it's an IP address. Otherwise it's a domain
> name.
>
> There are phishing sites that pull stuff like this, and I'm parsing a
> long list
> of such sites. So I really do need to get the hard cases right.
>
> Is there any library function that correctly tests for an IP address vs. a
> domain name based on syntax, i.e. without looking it up in DNS?
>
> John Nagle
More information about the Python-list
mailing list