URL parsing for the hard cases

memracom at yahoo.com memracom at yahoo.com
Sun Jul 22 17:12:27 EDT 2007


On 22 Jul, 18:56, John Nagle <na... at animats.com> wrote:
>     Is there something available that will parse the "netloc" field as
> returned by URLparse, including all the hard cases?  The "netloc" field
> can potentially contain a port number and a numeric IP address.  The
> IP address may take many forms, including an IPv6 address.
>
>     I'm parsing URLs used by hostile sites, and the wierd cases come up
> all too frequently.

I assume that when you say "netloc" you are referring to the second
field returned by the urlparse module. If this netloc contains an IPv6
address then it will also contain square brackets. The colons inside
the [] belong to the IPv6 address and the single possible colon
outside the brackets belongs to the port number. Of course, you might
want to try to help people who do not follow the RFCs and failed to
wrap the IPv6 address in square brackets. In that case, try...expect
comes in handy. You can try to parse an IPv6 address and if it fails
because of too many segments, then fallback to some other behaviour.

The worst case is a URL like http://2001::123:4567:abcd:8080/something.
Does the 8080 refer to a port number or part of the IPv6 address. If I
had to support non-bracketed IPv6 addresses, then I would interpret
this as http://[2001::123:4567:abcd]:8080/something.

RFC3986 is the reference for correct URL formats.

Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon




More information about the Python-list mailing list