Extracting links!

Muhammad z_m1 at hotmail.com
Mon Feb 10 07:52:24 EST 2003


I'm building a search engine for our interior sites in Perl. 
The engine uses index (word->urls), I built an index perl script that
indexes all the pages on a site, and start running it in our server,
but it was so heavy, took a lot of time and  stacked the server
sometimes.
Now I'm changing the way of indexing by running the script from a
regular Linux computer (not a server), and  I'm working with urls
instead of full paths for opening those files and indexing them.
I have a routine that extract all the urls from a given page by using
the module HTML::LinkExtor, but I'm not succeeding to make it dive
deeply to all of the links levels and pick up all the links that the
page point to also, in other words to make it work recursively and to
extract all the urls of  all the files in a site from the main page of
it (index.html or whatever)


This is the routine:

sub extract_links {
    my ($myurl) = @_;
    
    my $parser = HTML::LinkExtor->new(undef,$myurl);
    $parser->parse(get($myurl))->eof;
    my @links = $parser->links;
    
    foreach $linkarray (@links){
	my @element = @$linkarray;
	my $elt_type = shift @element;
	while (@element){
	    my($attr_name , $attr_value) = splice (@element, 0, 2);

	    #I wanna pick up just the href tags, not the src tags
	    #that can contain a url for an image.
	    #if it is a href tag,to check if it doesn't contain
	    #javascript url. examp: href="document.javascript..."
	    #an to check also if it matches the domain of the site
	    #that we want to index!
	    if ($attr_name eq "href" && $attr_value =~ /^http:/ &&
$attr_value =~ /$url/){
		#print "$attr_name : $attr_value\n";
		$seen{$attr_value}++;
	    }
	}
    }

    for (sort keys %seen){ 
	#&extract_links($_);
	push (@urls,$_); 
	&extract_links($_) if ($myurl ne $_);
    }
    #print join("\n", at urls), "\n";
}

how can make do that perpose????

Best regards,
Moudi.




More information about the Python-list mailing list