unknown encoding problem

Fri Apr 8 10:50:19 EDT 2005

Uwe Mayer wrote:

> I need to read in a text file which seems to be stored in some unknown
> encoding. Opening and reading the files content returns:
> 
>>>> f.read()
> '\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
> 
> Each character has a \x00 prepended to it. I suspect its some kind of
> unicode - how do I get rid of it?

Intermittent '\x00' bytes are a indeed strong evidence for unicode. Use
codecs.open() to access the data in such a file:

>>> import codecs
>>> f = codecs.open(filename, "r", "UTF-16-BE")
>>> f.read()
u'  <logEntry'

If you don't want unicode, convert back to str:

>>> _.encode("latin1")
'  <logEntry'

Note that the last step may fail if the file contains characters not
available in the string encoding you specify.

Peter