extracting from web pages but got disordered words sometimes

Paul McGuire ptmcg at austin.rr.com
Sat Jan 27 14:26:46 EST 2007


After looking at the pyparsing results, I think I see the problem with 
your original code.  You are selecting only the characters after the 
rightmost "-" character, but you really want to select everything to 
the right of "- -".  In some of the titles, the encoded Chinese 
includes a "-" character, so you are chopping off everything before 
that.

Try changing your code to:
    title=full_title.split("- -")[1]

I think then your original program will work.

-- Paul




More information about the Python-list mailing list