write xml to txt encoding

Stefan Behnel stefan_ml at behnel.de
Thu Jul 29 08:54:54 EDT 2010


William Johnston, 29.07.2010 14:12:
> I have a Python app that parses XML files and then writes to text files.

XML or HTML?


> However, the output text file is "sometimes" encoded in some Asian language.
>
> Here is my code:
>
>
> encoding = "iso-8859-1"
>
> clean_sent = nltk.clean_html(sent.text)
>
> clean_sent = clean_sent.encode(encoding, "ignore");
>
>
> I also tried "UTF-8" encoding, but received the same results.

What result?

Maybe the NLTK cannot determine the encoding of the HTML file (because the 
file is broken and/or doesn't correctly specify its own encoding) and thus 
fails to decode it?

Stefan




More information about the Python-list mailing list