Output of HTML parsing

Jackie jackie.BPUG at gmail.com
Fri Jun 15 09:57:55 EDT 2007


Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:


--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
    item_new=" ".join(item.split())                #Suppress the
spaces between 'title' and </td>
    title.extend([item_new])


output =[]
for i in range(len(name)):
    output.insert(i,[name[i],title[i]])            #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output)                           #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie




More information about the Python-list mailing list