Puzzled by code pages

Adam Tauno Williams awilliam at whitemice.org
Sat May 15 10:12:40 EDT 2010


On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:
> On 05/15/10 10:27, Adam Tauno Williams wrote:
> > I'm trying to process OpenStep plist files in Python.  I have a parser
> > which works, but only for strict ASCII.  However plist files may contain
> > accented characters - equivalent to ISO-8859-2 (I believe).  For example
> > I read in the line:
> >>>> handle = open('file.txt', 'rb')
> >>>> data = handle.read()
> >>>> handle.close()
> >>>> data
> > '    "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
> > NSFileName;\n'
> I presume you're using Python 2.x.

Yes.  But the days of all-unicode-strings will be wonderful when it
comes. :)

> > What is the correct way to re-encode this data into UTF-8 so I can use
> > unicode strings, and then write the output back to ISO8859-?
> > I can read the file using codecs as ISO8859-2, but it still doesn't seem
> > correct.
> >>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2')
> >>>> data = handle.read()
> >>>> handle.close()
> >>>> data
> > u'    "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
> > NSFileName;\n'
> When printing in the interactive interpreter, python uses __repr__
> representation by default. If you want to use __str__ representation use
> "print data" (note, your terminal must support printing unicode
> characters); 

Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.

> either way, even though the string looks like '\u0102' when
> printed on the terminal, the binary pattern inside the memory should
> correctly represents the accented character.

Yep.  But in the interpreter both unicode() and repr() produce the same
output.  Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)

unicode(data)
u'    "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

repr(data)
'u\'    "skyp4_filelist_10201/localit\\u0102\\xa0 termali_sortfield" =
NSFileName;\\n\''

I think I'm getting close.  Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2.  I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

What is the 'official' way to encode something from UTF-8 to another
code page.  I *assumed* that if I wrote a unicode stream back through:

h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()

that is would be re-encoded [word?].  But maybe not?



> f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
> f2 = codecs.open("out.txt", 'wb', encoding="utf-8")
> s = f.read()
> f2.write(s)
> f.close()
> f2.close()

-- 
Adam Tauno Williams <awilliam at whitemice.org> LPIC-1, Novell CLA
<http://www.whitemiceconsulting.com>
OpenGroupware, Cyrus IMAPd, Postfix, OpenLDAP, Samba
-------------- next part --------------
    "skyp4_filelist_10201/località termali_sortfield" = NSFileName;


More information about the Python-list mailing list