URL parsing for the hard cases
Miles
semanticist at gmail.com
Mon Jul 23 16:34:20 EDT 2007
On 7/23/07, Miles wrote:
> On 7/22/07, John Nagle wrote:
> > Is there any library function that correctly tests for an IP address vs. a
> > domain name based on syntax, i.e. without looking it up in DNS?
>
> import re, string
>
> NETLOC_RE = re.compile(r'''^ # start of string
> (?:([^@])+@)? # 1:
> (?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
> ([^\[\]:]+)) # 3: IPv4 addr or reg-name
> (?::(\d+))? # 4: optional port
> $''', re.VERBOSE) # end of string
>
> def normalize_IPv4(netloc):
> try: # Assume it's an IP; if it's not, catch the error and return None
> host = NETLOC_RE.match(netloc).group(3)
> octets = [string.atoi(o, 0) for o in host.split('.')]
> assert len(octets) <= 4
> for i in range(len(octets), 4):
> octets[i-1:] = divmod(octets[i-1], 256**(4-i))
> for o in octets: assert o < 256
> host = '.'.join(str(o) for o in octets)
> except (AssertionError, ValueError, AttributeError): return None
> return host
Apparently this will generally work as well:
import re, socket
NETLOC_RE = ...
def normalize_IPv4(netloc):
try:
host = NETLOC_RE.match(netloc).group(3)
return socket.inet_ntoa(socket.inet_aton(host))
except (AttributeError, socket.error):
return None
Thanks to http://mail.python.org/pipermail/python-list/2007-July/450317.html
-Miles
More information about the Python-list
mailing list