Questions about working with character encodings

Thu Dec 15 03:03:51 EST 2005

Kenneth McDonald <kenneth.m.mcdonald at sbcglobal.net> wrote:
> I am going to demonstrate my complete lack of understanding as to  
> going back and forth between
> character encodings, so I hope someone out there can shed some light  
> on this.  I have always
> depended on the kindness of strangers... :-)
> 
> I'm playing around with some very simplistic french to english  
> translation. As some text to
> work with, I copied the following from a french news site:
> 
>     Dans les années 1960, plus d'une voiture sur deux vendues aux  
> Etats-Unis était fabriquée par GM.
>     Pendant que les ventes s'effondrent, les pertes se creusent :  
> sur les neuf premiers mois de l'année 2005,
>     elles s'élèvent à 3,8 milliards de dollars (3,18 milliards  
> d'euros), et le dernier trimestre s'annonce difficile.
>     Quant à la dette, elle est hors normes : 285 milliards de  
> dollars, soit une fois et demie le chiffre d'affaires.
>     GM est désormais considéré par les agences de notation  
> financière comme un investissement spéculatif.
>     Un comble pour un leader mondial !
> 
> Of course, it has lots of accented, non-ascii characters. However, it  
> posted just fine into both
> this email program (hopefully it displays equally well at the other  
> end), 

It has correct charset header indicating ISO-8859-1 encoding, so yes, it
displayed correctly.

> and into my Python
> editing program (jEdit).
> 
> To start with, I'm not at all cognizant of how either the editor or  
> the mail program could even
> know what encodings to use to display this text properly...

You did not tell us what OS are you using, but in case of Unix, it all
goes up and down with locale - you can transparently pass around text
data as long as the characters are in the repertoire of your locale - of
course, as long as the applications are locale-aware - many older ones
are not. (It is best to use UTF-8 encoding, so that all the more or less
obscure characters can be represented)

If you have Windows, it depends on programs working with old 8-bit ANSI
API, or new unicode API. If the programs use unicode API, you can
without problems pass data around, if they use 8-bit API, you are
restricted to the characters from your system codepage.

> 
> Next, having got the text into the Python file, I presumably have to  
> encode it as a Unicode
> string, but trying something like   text = u"""désormais considéré"""  
> complains to the effect
> that :
> 
>     UnicodeEncodeError: 'ascii' codec can't encode character u'\x8e'  
> in position 13: ordinal not in range(128)
> 
> This occurs even with the first line in the file of
> 
>     # -*- coding: latin-1 -*-
> 
> which I'd hoped would include what I think of as the latin characters  
> including all those ones with
> graves, agues, circonflexes, umlauts, cedilles, and so forth.  

latin-1 is not enough for proper French (lack of œ). It is not even
enough for English, it lacks proper typographic quotes and so on.

> Apparently it does not :-)

Well, it would be enough for your example, "désormais considéré"
does indeed fit into latin-1. But python complains about character \x8e,
which indeed does not belong to latin-1. Without knowing your OS and
your locale (or ANSI codepage), we cannot tell how it got there.

> 
> So I really have two questions:
> 
>   1) How the heck did jEdit understand the text with all the accents  
> I pasted into it? More
> specifically, how did it know the proper encoding to use?

jEdit is written in Java, right? Java has a good internal unicode
support, so if your OS allowed it, pasting from WWW browser worked since
the browser had to new the encoding (in order to display it properly).

> 
>   2) How do I get Python to understand this text? Is there some sort  
> of coding that will
> work in almost every circumstance?

utf-8, obviously. Unless you have a strong reason not to do so, use
utf-8 exclusively - you never know what strange character can appear
(even in plain English), and you working and tested application will
start crashing when it gets to the real worls.

So, use # -*- coding: utf-8 -*-, but MAKE SURE jEdit is configured to
save the file in utf-8 encoding (not knowing jEdit, I cannot tell you
how to achieve this, but jEdit's www page claims that jEdit does support
utf-8).

Then there is a little problem with python stdout trying to convert
unicode strings into system default encoding and failing if it cannot be
done, but let's leave this for the moment :-)

-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!