the unicode saga continues...

Sat Nov 14 02:32:07 EST 2009

Ethan Furman wrote:
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
> (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
>  >>> print u'\xed'
> í
>  >>> print u'\xed'.encode('cp437')
> í
>  >>> print u'\xed'.encode('cp850')
> í
>  >>> print u'\xed'.encode('cp1252')
> φ
>  >>> import locale
>  >>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
> 
> My confusion lies in my apparant codepage (cp1252), and the discrepancy
> with character u'\xed' which is absolutely an i with an accent; yet when
> I encode with cp1252 and print it, I get an o with a line.
                                     ^^^^^^^^^^^^^^^^^^^^^^
For the record: I read a small Greek letter phi in your posting, not an o 
with a line. If I encode according to my default locale (UTF-8), I get the 
letter i with the accent. If I encode with codepage 1252, I get a marker for 
an invalid character on my terminal. This is using Debian though, not MS 
Windows.

Try printing the repr() of that. The point is that internally, you have the 
codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages, 
which yields a string of bytes representing this thing ('\xa1', '\xa1' and 
'\xed'), useful for storing on disk when the file uses said codepage or 
other forms of IO.

Now, with a Unicode string, the output (print) knows what to do, it encodes 
it according to the defaultlocale and sends the resulting bytes to stdout. 
With a byte string, I think it directly forwards the content to stdout.

Note:
 * If you want to verify your code, rather use 'print repr(..)'.
 * I could imagine that your locale is simply not set up correctly.

Uli