extracting from web pages but got disordered words sometimes

Frank Potter could.net at gmail.com
Sat Jan 27 06:18:47 EST 2007


There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to      http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox 
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and 
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in 
Chinese), but to other pages, I got disordered word. Since each page 
has the same charset, I don't know why I can't get every title in the 
same way.

Here's my python code, get_title.py :

[CODE]
#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
    return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
    url=make_page_url(page_index)
    print "now getting: ", url
    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers={'User-Agent':user_agent}
    req=urllib2.Request(url,None,headers)
    response=urllib2.urlopen(req)
    #print response.info()
    page=response.read()

    #extract tile by beautiful soup
    soup=BeautifulSoup(page)
    full_title=str(soup.html.head.title.string)

    #title is in the format of "title --title"
    #use this code to delete the "--" and the duplicate title
    title=full_title[full_title.rfind('-')+1::]

    return title

for i in xrange(min_page,max_page):
    print get_page_title(i)
[/CODE]

Will somebody please help me out? Thanks in advance.




More information about the Python-list mailing list