recycling internationalized garbage

Fredrik Lundh fredrik at pythonware.com
Wed Mar 8 09:33:55 EST 2006


"aaronwmail-usenet at yahoo.com" wrote:

> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)?  The string might be in any of the
> myriad encodings that predate unicode.  Has anyone
> done this in Python already?  The output must be clean
> utf8 suitable for arbitrary xml parsers.

some alternatives:

braindead bruteforce:

    try to do strict decoding as utf-8.  if you succeed, you have an utf-8
    string.  if not, assume iso-8859-1.

slightly smarter bruteforce:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743

more advanced (but possibly not good enough for very short texts):

    http://chardet.feedparser.org/

</F> 






More information about the Python-list mailing list