extracting from web pages but got disordered words sometimes

Frank Potter could.net at gmail.com
Sat Jan 27 21:33:49 EST 2007


Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also 
dealing with Chinese html pages and nothing error happened. I read the 
old code and I find the difference. Change the page to unicode before 
feeding to beautiful soup, then everything will be OK.

On Jan 28, 3:26 am, "Paul McGuire" <p... at austin.rr.com> wrote:
> After looking at the pyparsing results, I think I see the problem with
> your original code.  You are selecting only the characters after the
> rightmost "-" character, but you really want to select everything to
> the right of "- -".  In some of the titles, the encoded Chinese
> includes a "-" character, so you are chopping off everything before
> that.
>
> Try changing your code to:
>     title=full_title.split("- -")[1]
>
> I think then your original program will work.
>
> -- Paul




More information about the Python-list mailing list