Puzzled by code pages

Sat May 15 12:20:58 EDT 2010

On 05/16/10 00:12, Adam Tauno Williams wrote:
> On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:
>> On 05/15/10 10:27, Adam Tauno Williams wrote:
>>> I'm trying to process OpenStep plist files in Python.  I have a parser
>>> which works, but only for strict ASCII.  However plist files may contain
>>> accented characters - equivalent to ISO-8859-2 (I believe).  For example
>>> I read in the line:
>>>>>> handle = open('file.txt', 'rb')
>>>>>> data = handle.read()
>>>>>> handle.close()
>>>>>> data
>>> '    "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
>>> NSFileName;\n'
>> I presume you're using Python 2.x.
> 
> Yes.  But the days of all-unicode-strings will be wonderful when it
> comes. :)
> 
>>> What is the correct way to re-encode this data into UTF-8 so I can use
>>> unicode strings, and then write the output back to ISO8859-?
>>> I can read the file using codecs as ISO8859-2, but it still doesn't seem
>>> correct.
>>>>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2')
>>>>>> data = handle.read()
>>>>>> handle.close()
>>>>>> data
>>> u'    "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
>>> NSFileName;\n'
>> When printing in the interactive interpreter, python uses __repr__
>> representation by default. If you want to use __str__ representation use
>> "print data" (note, your terminal must support printing unicode
>> characters); 
> 
> Using GNOME Terminal, so Unicode characters should display correctly.
> And I do see the characters when I 'cat' the file.

'cat' works because 'cat' works in bytes and doesn't try to interpret
the stream it is writing. You can tell python to output string instead
of unicode to get the same effect.

> h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
> data = h.read()
> h.close()
> str(data)
> 
> 'ascii' codec can't encode characters in position 33-34: ordinal not in
> range(128)

this means either your terminal can't print unicode or python for some
reason thinks that the terminal is ascii terminal. You can encode the
string manually, e.g.:

print u'\u0102\xa0'.encode('utf-8')

or you should figure out a way to set your terminal properly so python
recognizes it as utf-8 terminal, see
http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/

when python tries to print unicode object, python first needs to encode
that 'unicode' object into 'str'; by default python uses
sys.stdout.encoding to determine the encoding to use when printing
unicode object.

> unicode(data)
> u'    "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
> NSFileName;\n'

If data is a 'unicode', this is not surprising, as 'unicode(data)'
simply returns 'data'.

> I think I'm getting close.  Parsing the file seems to work, and while
> writing it out does not error, rereading my own output fails. :(
> Possibly I'm 'accidentally' writing the output as UTF-8 and not
> ISO8859-2.  I need the internal data to be UTF-8 but read as ISO8859-2
> and rewritten back to ISO8859-2 [at least that is what I believe from
> the OpenStep files I'm seeing].

unicode string doesn't have encoding (well, python needs some encoding
to store the unicode data in RAM, but that's implementation detail).
unicode string is not a stream of bytes encoded in specific way, it's an
encoding-agnostic block of text.

> What is the 'official' way to encode something from UTF-8 to another
> code page.  I *assumed* that if I wrote a unicode stream back through:
> 
> h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
> data = writer.store(defaults)
> h.write(data)
> h.close()

what's "writer.store(defaults)"? It should return a 'unicode' if you
want h.write() to work properly. Otherwise, if data is 'str', h.write
will try to decode the 'str' to 'unicode' using the default decoder
(usually ascii), then encode that 'unicode' to 'iso8859-2'.