Puzzled by code pages

Sat May 15 15:31:26 EDT 2010

"Adam Tauno Williams" <awilliam at whitemice.org> wrote in message 
news:1273932760.3929.18.camel at linux-yu4c.site...
> On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:
>> On 05/15/10 10:27, Adam Tauno Williams wrote:
 [snip]

> Yep.  But in the interpreter both unicode() and repr() produce the same
> output.  Nothing displays the accented character.
>
> h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
> data = h.read()
> h.close()
> str(data)

Here you are correctly reading an iso8859-2-encoded file and converting it 
to Unicode.

Try "print data".  "str(data)" converts from Unicode strings to byte 
strings, but only uses the default encoding, which is 'ascii'.  print will 
use the stdout encoding of your terminal, if known.   Try these commands on 
your system (mine is Windows XP):

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp437'

You should only attempt to "print" Unicode strings or byte strings encoded 
in the stdout encoding.  Printing byte strings in any other encoding will 
often print garbage.

[snip]
> I think I'm getting close.  Parsing the file seems to work, and while
> writing it out does not error, rereading my own output fails. :(
> Possibly I'm 'accidentally' writing the output as UTF-8 and not
> ISO8859-2.  I need the internal data to be UTF-8 but read as ISO8859-2
> and rewritten back to ISO8859-2 [at least that is what I believe from
> the OpenStep files I'm seeing].

"internal data" is Unicode, not UTF-8.  Unicode is the absence of an 
encoding (Python uses UTF-16 or UTF-32 internally, but that is an 
implementation detail).  UTF-8 is a byte-encoding.

If you actually need the internal data as UTF-8 (maybe you are working with 
a library that works with UTF-8 strings, then:

>>> f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
>>> s = f.read()  # s is a Unicode string.
>>> s = s.encode('utf-8') # now s is a UTF-8 byte string
>>> f.close()

(process data as UTF-8 here).

>>> s = s.decode('utf-8') # s is Unicode again.
>>> f2 = codecs.open("out.txt", 'wb', encoding="iso8859-2")
>>> f2.write(s)
>>> f2.close()

Note you *decode* byte strings to Unicode and *encode* Unicode into byte 
strings.

-Mark