URL parsing for the hard cases
John Nagle
nagle at animats.com
Sun Jul 22 22:14:10 EDT 2007
memracom at yahoo.com wrote:
> Once you eliminate IPv6 addresses, parsing is simple. Is there a
> colon? Then there is a port number. Does the left over have any
> characters not in [0123456789.]? Then it is a name, not an IPv4
> address.
>
> --Michael Dillon
>
You wish. Hex input of IP addresses is allowed:
http://0x525eedda
and
http://0x52.0x5e.0xed.0xda
are both "Python.org". Or just put
0x52.0x5e.0xed.0xda
into the address bar of a browser. All these work in Firefox on Windows and
are recognized as valid IP addresses.
On the other hand,
0x52.com
is a valid domain name, in use by PairNIC.
But
http://test.0xda
is handled by Firefox on Windows as a domain name. It doesn't resolve, but it's
sent to DNS.
So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address. Otherwise it's a domain name.
There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites. So I really do need to get the hard cases right.
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
John Nagle
More information about the Python-list
mailing list