URL parsing for the hard cases

Sun Jul 22 22:14:10 EDT 2007

memracom at yahoo.com wrote:

> Once you eliminate IPv6 addresses, parsing is simple. Is there a
> colon? Then there is a port number. Does the left over have any
> characters not in [0123456789.]? Then it is a name, not an IPv4
> address.
> 
> --Michael Dillon
> 

   You wish.  Hex input of IP addresses is allowed:

	http://0x525eedda

and

	http://0x52.0x5e.0xed.0xda

are both "Python.org".  Or just put

	0x52.0x5e.0xed.0xda

into the address bar of a browser.  All these work in Firefox on Windows and
are recognized as valid IP addresses.

On the other hand,

	0x52.com

is a valid domain name, in use by PairNIC.

But

	http://test.0xda

is handled by Firefox on Windows as a domain name.  It doesn't resolve, but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number.  If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address.  Otherwise it's a domain name.

There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites.  So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

				John Nagle