URL parsing for the hard cases

Mon Jul 23 16:34:20 EDT 2007

On 7/23/07, Miles wrote:
> On 7/22/07, John Nagle wrote:
> > Is there any library function that correctly tests for an IP address vs. a
> > domain name based on syntax, i.e. without looking it up in DNS?
>
> import re, string
>
> NETLOC_RE = re.compile(r'''^ #    start of string
>     (?:([^@])+@)?            # 1:
>     (?:\[([0-9a-fA-F:]+)\]|  # 2: IPv6 addr
>     ([^\[\]:]+))             # 3: IPv4 addr or reg-name
>     (?::(\d+))?              # 4: optional port
> $''', re.VERBOSE)            #    end of string
>
> def normalize_IPv4(netloc):
>     try: # Assume it's an IP; if it's not, catch the error and return None
>         host = NETLOC_RE.match(netloc).group(3)
>         octets = [string.atoi(o, 0) for o in host.split('.')]
>         assert len(octets) <= 4
>         for i in range(len(octets), 4):
>             octets[i-1:] = divmod(octets[i-1], 256**(4-i))
>         for o in octets: assert o < 256
>         host = '.'.join(str(o) for o in octets)
>     except (AssertionError, ValueError, AttributeError): return None
>     return host

Apparently this will generally work as well:

import re, socket

NETLOC_RE = ...

def normalize_IPv4(netloc):
    try:
        host = NETLOC_RE.match(netloc).group(3)
        return socket.inet_ntoa(socket.inet_aton(host))
    except (AttributeError, socket.error):
        return None

Thanks to http://mail.python.org/pipermail/python-list/2007-July/450317.html

-Miles