[PythonCE] Unicode default encoding

Thu Mar 2 10:35:26 CET 2006

Jeffrey Barish wrote:

>>Luke Dunstan wrote:
>>    
>>
>>>----- Original Message ----- 
>>>From: "Jeffrey Barish" <jeff_barish at earthlink.net>
>>>To: <pythonce at python.org>
>>>Sent: Friday, February 24, 2006 11:03 AM
>>>Subject: [PythonCE] Unicode default encoding
>>>  
>>>      
>>>
>>>>What is the correct way to set PythonCE's default Unicode encoding?  My
>>>>reading (Python in a Nutshell) indicates that I am supposed to make a 
>>>>change to site.py, but there doesn't seem to be a site.py in
>>>>PythonCE.  (The  closest I came is a site.pyc in python23.zip.)  Nutshell
>>>>suggests that in desperation one could put the following at the start of
>>>>the main script:   
>>>>
>>>>import sys
>>>>reload(sys)
>>>>sys.setdefaultencoding('iso-8859-15')
>>>>del sys.setdefaultencoding
>>>>
>>>>This code solved the problem I was having reading and processing text that
>>>>contains Unicode characters, but I am uncomfortable leaving a desperation
>>>>solution in place.
>>>>
>>>>        
>>>>
>>>I don't think modifying site.py would be a good solution, because if you 
>>>upgrade or reinstall python then the script will be overwritten. If you
>>>only  want to run your program on your own system then a better solution is
>>>to  create a file sitecustomize.py in your Python\Lib directory containing
>>>this: 
>>>
>>>import sys
>>>sys.setdefaultencoding('iso-8859-15')
>>>
>>>If you want to distribute your program to other people though, you can't 
>>>expect them to change their default encoding so it is better not to rely on 
>>>the default encoding at all.
>>>
>>>  
>>>      
>>>
>>Yep, using unicode and explicitly encoding/decoding is a better approach.
>>
>>Fuzzyman
>>    
>>
>
>Once again, I am forced to display my ignorance.  Sorry guys.  I really don't 
>know much about Unicode.  The solution that Luke suggested (sitecustomize.py 
>in my Python\Lib directory) works fine for me, but I am concerned about the 
>suggestion from him and Fuzzyman that explicit encoding/decoding is a better 
>approach.  What is explicit encoding/decoding?  Can someone point me to a 
>good resource for learning how to deal with Unicode correctly?
>  
>
Unicode, and text encodings in general, is a bit of a learning curve.
Once you get your head round it, Python makes it pretty straightforward.

Simple rules :

* In Python text *really* means a unicode string
* Because ordinary strings are really just strings of bytes
* If you know the encoding, decode it to turn it into encoding
* When writing or printing, encode it to turn it back into bytes
* If you don't know the encoding then you better pray that whatever it
is is encoded in the system default. ;-)

byte_string = open(filename).read() # read a file
text = byte_string.decode('utf_8')    # we know it is UTF8, so we decode
to unicode
# ....code that uses the text
byte_string = text.encode('utf_8')   # we encode it back to UTF8
open(filename, 'w').write(byte_string) # so we can write it back out

Decoding turns a byte string into a unicode object.
Encoding turns a unicode object into a byte string.

If this still confuses you (which it probably does) then there are lots
of good resources. I happen to like :

    http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html

Which seems to be down at the moment. :-(

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml