Handling letters with accents

Fri Feb 7 11:05:17 EST 2003

Martyn Quick wrote:
> Hi All,
> 
> This is probably standard stuff, and I suspect it has something to do with
> unicode, but I don't as yet know enough to deal with it.
> 
> If I create a Tkinter entry (with attached textvariable), then someone
> using Windows can use various keys (e.g., Ctr+Alt+u) to enter letters with
> accents into the entry.  How do I then retrieve the data entered in such a
> way that I can deal with it?  In the end, I would like to export it to
> HTML, so I will also need to be able to convert this stuff to the HTML
> standard ("ú" for the above example), and again I'm not certain how
> this is done (though not knowing the answer to the previous question is
> clearly an obstacle).
> 
> To give a basic example (which illustrates the problem on a Windows98
> machine, and probably on later versions of Windows too)...
> 
> import Tkinter
> root = Tkinter.Tk()
> var = Tkinter.StringVar()
> Tkinter.Entry(root, textvariable=var).pack()
> 
> Typing CTR+ALT+u into the entry and then calling
> 
> print var.get()
> 
> produces a small amount of garbage that looks nothing like the u with
> acute accent that appears in the entry box.

In fact, it *is* the accented u that you typed in the entry, except it appears 
to be coded in UTF-8, which is the base Tk encoding. Considering the default 
encoding for the console where your print statements go is certainly the 
"standard" latin1 encoding, doing:

print unicode(var.get(), 'utf-8').encode('iso8859-1', 'replace')

instead of just "print var.get()" should do what you want.

To explain it simply, doing unicode(s, encodingName) attaches the encoding to 
the string. The result is a unicode string, which is a bit more than a plain 
string since it knows its encoding. Then, you can re-encode the unicode string 
in another encoding by calling its encode method. The first parameter is the new 
encoding (in the example, 'iso8859-1' is just the real name for latin1), and the 
second one tells what to do if a charcater originally in the string has no 
equivalent in the new encoding. Passing 'replace' ensures that no exception will 
be raised; the missing character will just be replaced by a '?'. The result of 
the encode method is a new plain string.

However, I've no idea on how to get HTML sequences from these accented 
characters, whatever the encoding. Any idea, anyone?

HTH
-- 
- Eric Brunel <eric.brunel at pragmadev.com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com