Output of HTML parsing

Fri Jun 15 09:57:55 EDT 2007

Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:

--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
    item_new=" ".join(item.split())                #Suppress the
spaces between 'title' and </td>
    title.extend([item_new])

output =[]
for i in range(len(name)):
    output.insert(i,[name[i],title[i]])            #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output)                           #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie