Questions about working with character encodings

Kenneth McDonald kenneth.m.mcdonald at sbcglobal.net
Wed Dec 14 21:44:42 EST 2005


I am going to demonstrate my complete lack of understanding as to  
going back and forth between
character encodings, so I hope someone out there can shed some light  
on this.  I have always
depended on the kindness of strangers... :-)

I'm playing around with some very simplistic french to english  
translation. As some text to
work with, I copied the following from a french news site:

     Dans les années 1960, plus d'une voiture sur deux vendues aux  
Etats-Unis était fabriquée par GM.
     Pendant que les ventes s'effondrent, les pertes se creusent :  
sur les neuf premiers mois de l'année 2005,
     elles s'élèvent à 3,8 milliards de dollars (3,18 milliards  
d'euros), et le dernier trimestre s'annonce difficile.
     Quant à la dette, elle est hors normes : 285 milliards de  
dollars, soit une fois et demie le chiffre d'affaires.
     GM est désormais considéré par les agences de notation  
financière comme un investissement spéculatif.
     Un comble pour un leader mondial !

Of course, it has lots of accented, non-ascii characters. However, it  
posted just fine into both
this email program (hopefully it displays equally well at the other  
end), and into my Python
editing program (jEdit).

To start with, I'm not at all cognizant of how either the editor or  
the mail program could even
know what encodings to use to display this text properly...

Next, having got the text into the Python file, I presumably have to  
encode it as a Unicode
string, but trying something like   text = u"""désormais considéré"""  
complains to the effect
that :

     UnicodeEncodeError: 'ascii' codec can't encode character u'\x8e'  
in position 13: ordinal not in range(128)

This occurs even with the first line in the file of

     # -*- coding: latin-1 -*-

which I'd hoped would include what I think of as the latin characters  
including all those ones with
graves, agues, circonflexes, umlauts, cedilles, and so forth.  
Apparently it does not :-)

So I really have two questions:

   1) How the heck did jEdit understand the text with all the accents  
I pasted into it? More
specifically, how did it know the proper encoding to use?

   2) How do I get Python to understand this text? Is there some sort  
of coding that will
work in almost every circumstance?

Many thanks for your patience with someone completely new to this  
aspect of text handling,

Ken


More information about the Python-list mailing list