Distinguishing cp850 and cp1252?

John Roth newsgroups at jhrothjr.com
Sun Nov 2 21:35:26 EST 2003


"David Eppstein" <eppstein at ics.uci.edu> wrote in message
news:eppstein-FD3246.17361302112003 at news.service.uci.edu...
> I'm working on some Python code for reading files in a certain format,
> and the examples of such files I've found on the internet appear to be
> in either cp850 or cp1252 encoding (except for one exception for which I
> can't find a correct encoding among the standard Python ones).
>
> The file format itself includes nothing about which encoding is used,
> but only one of the two produces sensible results in the non-ascii
> examples I've seen.
>
> Is there an easy way of guessing with reasonable accuracy which of these
> two incodings was used for a particular file?

The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.

I know this doesn't help all that much, but it's the only thing
that has worked for me.

John Roth
>
> -- 
> David Eppstein                      http://www.ics.uci.edu/~eppstein/
> Univ. of California, Irvine, School of Information & Computer Science






More information about the Python-list mailing list