Detect character encoding

Nemesis nemesis at nowhere.invalid
Sun Dec 4 14:45:56 EST 2005


Mentre io pensavo ad una intro simpatica "Michal" scriveva:

> Hello,
> is there any way how to detect string encoding in Python?
> I need to proccess several files. Each of them could be encoded in 
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and 
> encode it to utf-8 (with string function encode).
> Thank you for any answer

Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

    def guess_encoding(text):
        for best_enc in guess_list:
            try:
                unicode(text,best_enc,"strict")
            except:
                pass
            else:
                break
        return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
-- 
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.
 
 |\ |       |HomePage   : http://nem01.altervista.org
 | \|emesis |XPN (my nr): http://xpn.altervista.org




More information about the Python-list mailing list