url question - extracting (2 types of) domains

Michael Bentley michael at jedimindworks.com
Wed May 16 03:32:43 EDT 2007


On May 15, 2007, at 9:04 PM, lazy wrote:

> Hi,
> Im trying to extract the domain name from an url. lets say I call
> it full_domain and significant_domain(which is the homepage domain)
>
> Eg: url=http://en.wikipedia.org/wiki/IPod ,
> full_domain=en.wikipedia.org ,significant_domain=wikipedia.org
>
> Using urlsplit (of urlparse module), I will be able to get the
> full_domain, but Im wondering how to get significant_domain. I will
> not be able to use like counting the number of dots. etc
>
> Some domains maybe like foo.bar.co.in (where significant_domain=
> bar.co.in)
> I have around 40M url list. Its ok, if I fallout in few(< 1%) cases.
> Although I agree that measuring this error rate itself is not clear,
> maybe just based on ituition.
>
> Anybody have clues about existing url parsers in python to do this.
> Searching online couldnt help me much other than
> the urlparse/urllib module.
>
> Worst case is to try to build a table of domain
> categories(like .com, .co.il etc and look for it in the suffix rather
> than counting dots and just extract the part till the preceding dot),
> but Im afraid if I do this, I might miss some domain category.

The best way I know to get an *authoritive* answer is to start with  
the full_domain and try a whois lookup.  If it returns no records,  
drop everything before the first dot and try again.  Repeat until you  
get a good answer -- this is the significant_domain.

hth,
Michael




More information about the Python-list mailing list