url question - extracting (2 types of) domains

lazy arunmail at gmail.com
Wed May 16 05:05:00 EDT 2007


Thanks.
Hmm, the url list is quite huge(40M). I think it will take a lot of
time,for a whois lookup I guess. But yeah,
thats seems to be a good way. Probably I will try it with a smaller
set (10K) and see the time it takes.
If not, I guess I will just build a table of known
domains(.com,.org,.co.il etc ) and then I can find the
root domain(significant_domain) atleast for those and I hope majority
of them fall into this :)


On May 16, 12:32 am, Michael Bentley <mich... at jedimindworks.com>
wrote:
> On May 15, 2007, at 9:04 PM, lazy wrote:
>
>
>
> > Hi,
> > Im trying to extract the domain name from an url. lets say I call
> > it full_domain and significant_domain(which is the homepage domain)
>
> > Eg: url=http://en.wikipedia.org/wiki/IPod,
> > full_domain=en.wikipedia.org ,significant_domain=wikipedia.org
>
> > Using urlsplit (of urlparse module), I will be able to get the
> > full_domain, but Im wondering how to get significant_domain. I will
> > not be able to use like counting the number of dots. etc
>
> > Some domains maybe like foo.bar.co.in (where significant_domain=
> > bar.co.in)
> > I have around 40M url list. Its ok, if I fallout in few(< 1%) cases.
> > Although I agree that measuring this error rate itself is not clear,
> > maybe just based on ituition.
>
> > Anybody have clues about existing url parsers in python to do this.
> > Searching online couldnt help me much other than
> > the urlparse/urllib module.
>
> > Worst case is to try to build a table of domain
> > categories(like .com, .co.il etc and look for it in the suffix rather
> > than counting dots and just extract the part till the preceding dot),
> > but Im afraid if I do this, I might miss some domain category.
>
> The best way I know to get an *authoritive* answer is to start with
> the full_domain and try a whois lookup.  If it returns no records,
> drop everything before the first dot and try again.  Repeat until you
> get a good answer -- this is the significant_domain.
>
> hth,
> Michael





More information about the Python-list mailing list