unicode by default

Benjamin Kaplan benjamin.kaplan at case.edu
Thu May 12 00:14:49 EDT 2011


On Wed, May 11, 2011 at 8:44 PM, harrismh777 <harrismh777 at charter.net> wrote:
> Steven D'Aprano wrote:
>>>
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>>
>> 12
>>>>>
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
>   ... same with emacs, same with gedit  ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3
>
>
> This is *not* what I expected... well it is (little-endian) right up to the
> 'c2' and that is what is confusing me....
>
> I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
> by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
> what I got instead was UTF-8 little-endian  'c2a3' ....
>
quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always "little endian".

> See my problem?... when I open the file with emacs I see the character pound
> sign... same with gedit... they're all using UTF-8 by default. By default it
> looks like Python3 is writing output with UTF-8 as default... and I thought
> that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
> here...  also, I used the character sequence \u00A3 which I thought was
> UTF-16... but Python3 changed my intent to  'c2a3' which is the normal
> UTF-8...
>

The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).



More information about the Python-list mailing list