Looking for an appropriate encoding standard that supports all languages

Thomas Jollans thomas at jollybox.de
Fri Aug 20 15:04:55 EDT 2010


On Thursday 19 August 2010, it occurred to ata.jaf to exclaim:
> On Aug 17, 11:55 pm, Thomas Jollans <tho... at jollybox.de> wrote:
> > On Tuesday 17 August 2010, it occurred to ata.jaf to exclaim:
> > > I am developing a little program in Mac with wxPython.
> > > But I have problems with the characters that are not in ASCII. Like
> > > some special characters in French or Turkish.
> > > So I am looking for a way to solve this. Like an encoding standard
> > > that supports all languages. Or some other way.
> > 
> > Anything that supports all of Unicode will do. Like UTF-8. If your text
> > is mostly Latin, then just go for UTF-8, if you use other alphabets
> > extensively, you might want to consider UTF-16, which might the use a
> > little less space.
> 
> OK, I used UTF-8.
> I write a line of strings in the source code and I want my program to
> show that as an output on GUI. And this line of strings includes a
> character like "ü". But I see that in GUI this character is replaced
> with another strange characters. I mean it doesn't work.
> And when I try to use UTF-16, I get an syntax error that declares
> "UTF-16 stream does not start with BOM".

I get the feeling you're not actually using the encoding you say you're using, 
or not telling every program involved what you're doing.

1. Save the file in the correct encoding. Either tell your text editor to use 
a specific encoding (UTF-8 would be a good choice), or find out what encoding 
your text editor is using and use that encoding during the rest of the 
process.

2. Tell Python which encoding you're using. The coding: line will do the 
trick, *provided* you don't lie, and the encoding your specify in the file is 
actually the encoding you're using to store the file on disk.

3. Instruct your GUI library to do the right thing. If you use unicode strings 
(either by using Python 3 or by using the u"Käse" syntax in Python 2), that 
should be enough, otherwise, if you're using byte strings, which you shouldn't 
be doing in this case, you might have to tell the library what you're doing, 
or use the customary encoding. (For GTK+, this is UTF-8. For other libraries, 
it might be Latin-1, or system-dependent)



More information about the Python-list mailing list