unicode experiments + questions

Irmen de Jong irmen at NOSPAMREMOVETHISxs4all.nl
Wed Mar 27 17:24:52 EST 2002


Hello
I was experimenting with Python 2.2's unicode strings and different encodings.
First, what is the best way to enter international and/or unicode characters
in your Python source? I can't just type the Euro sign and trust that it
is read as a Euro sign on another platform, because how does Python know
the character encoding of the source file!
So I'm using the unicode escape char syntax, but that is cumbersome
(where do I look up all my special characters?) and hard on the eyes.

I also have the following question:
what exactly happens when I type  "print u" in Python, where u
is a unicode string? for example;

>>> e=u'\u20ac'
>>> e
u'\u20ac'
>>> print e
€    (<--- this is an Euro symbol on my screen)

What charset does the print convert to?
I'm on Win2000, so when I type
>>> print e.encode('cp1252')
I get the Euro symbol. Does print automatically convert to the windows charset
cp1252?
How does Python know this charset, because in my syte.py encoding="iso-8859-15".

When I type
>>> print e.encode('iso-8859-15')
I don't see the Euro symbol, but some other weird symbol.


For your interest, below is the test program I'm using to generate different
encoded documents.
Interestingly enough, UTF-7 is not understood by Opera 6. IE 5 and Mozilla get
it right.

import codecs

euro=u'\u20ac'
atilde=u'\u00e3'

print 'The unicode euro symbol = ',repr(euro),'  -->  ',euro
print 'The unicode a-tilde = ',repr(atilde),'  -->  ',atilde

def getHTMLDoc(encoding):
 doc=u'<HTML><HEAD>\n' \
     u'<META http-equiv=Content-Type content="text/html;
charset='+encoding+u'">\n' \
     u'<TITLE>Euro sign etc</TITLE></HEAD>\n' \
     u'<BODY><P>EURO SIGN: ' + euro + u'\n'\
     u'<P>A TILDE: ' + atilde + u'\n</BODY></HTML>'
 return doc
def getXMLDoc(encoding):
 doc=u'<?xml version="1.0" encoding="'+encoding+u'"?>\n' \
     u'<root><euro>'+euro+u'</euro><atilde>'+atilde+u'</atilde></root>'
 return doc

def makeEncoding(encoding):
 doc=getHTMLDoc(encoding)
 codecs.open('euro-'+encoding+'.html','wb',encoding).write(doc)
 doc=getXMLDoc(encoding)
 codecs.open('euro-'+encoding+'.xml','wb',encoding).write(doc)


for enc in ['utf-16','utf-8','utf-7','iso-8859-15','cp1252']:
 makeEncoding(enc)
 print 'Euro symbol in '+enc+' is: '+repr(euro.encode(enc))+'  -->
'+euro.encode(enc)
 print 'A-tilde in '+enc+' is: '+repr(atilde.encode(enc))+'  -->
'+atilde.encode(enc)






More information about the Python-list mailing list