unicode by default

Wed May 11 18:34:02 EDT 2011

On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777 at charter.net> wrote:
> hi folks,
>   I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
>   I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
>   On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?
>

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

>   The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
>   If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

 If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.