In Python 2.x, is it possible to make unicode as default like in Python 3.x?

Wed Jun 8 16:26:22 EDT 2011

On Wed, Jun 8, 2011 at 11:22 AM, G00gle and Python Lover
<pythech.tr at gmail.com> wrote:
> Hello.
> I almost like everything in Python. Code shrinking, logic of processes,
> libraries, code design etc.
> But, we... - everybody knows that Python 2.x has lack of unicode support.
> In Python 3.x, this has been fixed :) And I like 3.x more than 2.x
> But, still major applications haven't been ported to 3.x like Django.
> Is there a way to make 2.x behave like 3.x in unicode support?
> Is it possible to use Unicode instead of Ascii or remove ascii?
> Python with ascii sucks :S
> I know:
>>
>> >>> lackOfUnicodeSupportAnnoys = u'Yeah I finally made it! Should be a
>> >>> magical thing! Unmögötich! İnanılmaz! Süper...'
>>
>> >>> print lackOfUnicodeSupportAnnoys
>>
>> Yeah I finally made it! Should be a magical thing! Unmögötich! Ýnanýlmaz!
>> Süper...
>>
>> >>> # You see the Turkish characters are not fully supported...
>>
>> >>> print str(lackOfUnicodeSupportAnnoys)
>>
>> Traceback (most recent call last):
>>
>> File "<pyshell#7>", line 1, in <module>
>>
>> print str(lackOfUnicodeSupportAnnoys)
>>
>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
>> position 54: ordinal not in range(128)
>>
>> >>> # Encode decode really sucks...
>>
>> >>> lackOfUnicodeSupportAnnoys = 'Yeah I finally made it! Should be a
>> >>> magical thing! Unmögötich! İnanılmaz! Süper...'
>>
>> >>> # Look that I didn't use 'u'
>>
>> >>> print lackOfUnicodeSupportAnnoys
>>
>> Yeah I finally made it! Should be a magical thing! Unmögötich! İnanılmaz!
>> Süper...
>>
>> >>> # This time it worked, strange...
>>
>> >>> lackOfUnicodeSupportAnnoys = unicode('Yeah I finally made it! Should
>> >>> be a magical thing! Unmögötich! İnanılmaz! Süper...')
>>
>> Traceback (most recent call last):
>>
>> File "<pyshell#10>", line 1, in <module>
>>
>> lackOfUnicodeSupportAnnoys = unicode('Yeah I finally made it! Should be a
>> magical thing! Unmögötich! İnanılmaz! Süper...')
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 54:
>> ordinal not in range(128)
>>
>> >>> # Some annoying error again
>>
>> >>> lackOfUnicodeSupportAnnoys
>>
>> 'Yeah I finally made it! Should be a magical thing! Unm\xf6g\xf6tich!
>> \xddnan\xfdlmaz! S\xfcper...'
>>
>> >>> # And finally, most annoying thing isn't it?
>
> Thanks...

I think you're misunderstanding what Unicode support means. Python 2
does have unicode support, but it doesn't do Unicode by default. And a
lack of Unicode by default does not mean ASCII either.

There are two ways of looking at strings: as a sequence of bytes and
as a sequence of characters. In python 2, a sequence of bytes is
declared by "" and a sequence of characters is declared u"". In Python
3, a sequence of bytes is declared as b"" and a sequence of characters
is declared "".

An encoding is a function that maps bytes to characters. The only time
it matters is when you are trying to convert from bytes to characters.
This is needed because you can't send characters out over a socket or
write them to a file- you can only send bytes.

When you want to convert from bytes to characters or vice versa, you
need to specify an encoding. So instead of doing str(foo), you should
do foo.encode(charset), where charset is the encoding that you need to
use in your output. Python will try to figure out the encoding your
terminal uses if it can, but if it can't, it will fall back to ASCII
(the lowest common denominator) rather than guess. That behavior has
not changed between Python 2 and Python 3 (except that Python is more
aggressive in its attempts to figure out the console encoding).

The reason your first example didn't work is because Python defaulted
to using one encoding to interpret the bytes when you declared the
string as Unicode (perhaps a Western Eurpean encoding) and that
encoding was different than the encoding your terminal uses. In a
Python script, you can fix that by declaring the encoding of the
source file using one of the methods specified in PEP 263 (implemented
in Python 2.3). The second example worked because there was no
conversion- you gave Python a sequence of bytes and it outputted that
sequence of bytes. Since your source and destination have the same
encoding, it happens to work out.

Your last example does show something that has changed as a result of
the Unicode switch. In Python 2, the repr() of a string was
intentionally shown as ASCII with the escape sequences for non-ASCII
characters to help people on terminals that didn't support the full
Unicode character set. Since the default type of string is Unicode in
Python 3, that's been switched to show the characters unless you
explicity encode the string using "string-escape".

The only other major thing that Python 3 added in addition to Unicode
being the default is that you can have non-ASCII variable names in
your source code.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>