[Tutor] search engine

Bob Gailer bgailer at alum.rpi.edu
Sat Jun 24 21:33:23 CEST 2006


vinodh kumar wrote:
> hai all,
>           i am a student of computer science dept. i have planned to 
> design a search engine in python..i am seeking info about how to 
> proceed further.
>          i need some example source code
That is an ambitious project. I wonder whether this is "homework". (It 
sounds too ambitious to be homework but one never knows). We don't 
provide code for homework but are glad to assist you when you get stuck.

Before coding I suggest you create a design or plan for the program. Do 
you want to emulate Google? (Do you understand what Google does?) Or 
something simpler? (I suggest simpler).

What are you searching for? How much information do you want to store? 
How do you want to present the results to a user?

Python provides a urllib2 module for getting the contents of a web page. 
This example gets the python.org main page and displays the first 100 
bytes of it:

>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> print f.read(100)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

That is the basic tool you'd use to get page contents.

BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ is a really 
good tool for parsing the page contents, looking for text and links.

I think those are the main ingredients of a search engine. The rest is 
various strategies for finding web sites from which to read pages.

I suggest you expand the above to a program that will read a given page, 
find the links to other pages and read them recursively. Then you need a 
way to look for the keywords of interest in the page text and store them 
with references to the links to the pages containing them. Python 
dictionaries are the way to collect this data and the shelve module 
provides a way to save Python objects such as dictionaries for later 
retrieval.

Hope this helps get you started. Someday your work may excel beyond Google.

-- 
Bob Gailer
510-978-4454



More information about the Tutor mailing list