Puzzled by code pages

Sat May 15 06:30:31 EDT 2010

On 05/15/10 10:27, Adam Tauno Williams wrote:
> I'm trying to process OpenStep plist files in Python.  I have a parser
> which works, but only for strict ASCII.  However plist files may contain
> accented characters - equivalent to ISO-8859-2 (I believe).  For example
> I read in the line:
> 
>>>> handle = open('file.txt', 'rb')
>>>> data = handle.read()
>>>> handle.close()
>>>> data
> '    "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
> NSFileName;\n'

I presume you're using Python 2.x.

> What is the correct way to re-encode this data into UTF-8 so I can use
> unicode strings, and then write the output back to ISO8859-?
> 
> I can read the file using codecs as ISO8859-2, but it still doesn't seem
> correct.
> 
>>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2')
>>>> data = handle.read()
>>>> handle.close()
>>>> data
> u'    "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
> NSFileName;\n'

When printing in the interactive interpreter, python uses __repr__
representation by default. If you want to use __str__ representation use
"print data" (note, your terminal must support printing unicode
characters); either way, even though the string looks like '\u0102' when
printed on the terminal, the binary pattern inside the memory should
correctly represents the accented character.

f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
f2 = codecs.open("out.txt", 'wb', encoding="utf-8")
s = f.read()
f2.write(s)
f.close()
f2.close()