unknown encoding problem
Peter Otten
__peter__ at web.de
Fri Apr 8 10:50:19 EDT 2005
Uwe Mayer wrote:
> I need to read in a text file which seems to be stored in some unknown
> encoding. Opening and reading the files content returns:
>
>>>> f.read()
> '\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
>
> Each character has a \x00 prepended to it. I suspect its some kind of
> unicode - how do I get rid of it?
Intermittent '\x00' bytes are a indeed strong evidence for unicode. Use
codecs.open() to access the data in such a file:
>>> import codecs
>>> f = codecs.open(filename, "r", "UTF-16-BE")
>>> f.read()
u' <logEntry'
If you don't want unicode, convert back to str:
>>> _.encode("latin1")
' <logEntry'
Note that the last step may fail if the file contains characters not
available in the string encoding you specify.
Peter
More information about the Python-list
mailing list