Codecs

John Machin sjmachin at lexicon.net
Sun Jul 10 20:29:40 EDT 2005


Ivan Van Laningham wrote:
> 
> It seems to me that if I want to try to read an unknown file
> using an exhaustive list of possible encodings ...


Supposing such a list existed:

What do you mean by "unknown file"? That the encoding is unknown?

Possibility 1:
You are going to try to decode the file from "legacy" to Unicode -- 
until the first 'success' (defined how?)? But the file could be decoded 
by *several* codecs into Unicode without an exception being raised. Just 
a simple example: the encodings ['iso-8859-' + x for x in '12459'] 
define *all* possible 256 characters.

There are various language-guessing algorithms based on e.g. frequency 
of ngrams ... try Google.

Possibility 2:
You "know" the file is in a Unicode-encoding e.g. utf-8, have 
successfully decoded it to Unicode, and are going to try to encode the 
file in a "legacy" encoding but you don't know which one is appropriate?
Sorry, same "But".






More information about the Python-list mailing list