I think you will find Mozilla's charset autodetection method interesting. A composite approach to language/encoding detection http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html Perhaps this can be used with PyXPCOM. I don't know.