Unicode

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Mar 15 06:58:19 EDT 2013


On Fri, 15 Mar 2013 11:46:36 +0100, Thomas Heller wrote:

> I thought I understand unicode (somewhat, at least), but this seems not
> to be the case.
> 
> I expected the following code to print 'µm' two times to the console:
> 
> <code>
> # -*- coding: cp850 -*-
> 
> a = u"µm"
> b = u"\u03bcm"
> 
> print(a)
> print(b)
> </code>
> 
> But what I get is this:
> 
> <output>
> µm
> Traceback (most recent call last):
>    File "x.py", line 7, in <module>
>      print(b)
>    File "C:\Python33-64\lib\encodings\cp850.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03bc' in
> position 0: character maps to <undefined> </output>
> 
> Using (german) windows, command prompt, codepage 850.
> 
> The same happens with Python 2.7.  What am I doing wrong?


That's because the two strings are not the same.

You can isolate the error by noting that the second one only raises an 
exception when you try to print it. That suggests that the problem is 
that it contains a character which is not defined in your terminal's 
codepage. So let's inspect the strings more carefully:


py> a = u"µm"
py> b = u"\u03bcm"
py> a == b
False
py> ord(a[0]), ord(b[0])
(181, 956)
py> import unicodedata
py> unicodedata.name(a[0])
'MICRO SIGN'
py> unicodedata.name(b[0])
'GREEK SMALL LETTER MU'

Does codepage 850 include Greek Small Letter Mu? The evidence suggests it 
does not.

If you can, you should set the terminal's encoding to UTF-8. That will 
avoid this sort of problem.



-- 
Steven



More information about the Python-list mailing list