Handling letters with accents
Eric Brunel
eric.brunel at pragmadev.com
Fri Feb 7 11:05:17 EST 2003
Martyn Quick wrote:
> Hi All,
>
> This is probably standard stuff, and I suspect it has something to do with
> unicode, but I don't as yet know enough to deal with it.
>
> If I create a Tkinter entry (with attached textvariable), then someone
> using Windows can use various keys (e.g., Ctr+Alt+u) to enter letters with
> accents into the entry. How do I then retrieve the data entered in such a
> way that I can deal with it? In the end, I would like to export it to
> HTML, so I will also need to be able to convert this stuff to the HTML
> standard ("ú" for the above example), and again I'm not certain how
> this is done (though not knowing the answer to the previous question is
> clearly an obstacle).
>
> To give a basic example (which illustrates the problem on a Windows98
> machine, and probably on later versions of Windows too)...
>
> import Tkinter
> root = Tkinter.Tk()
> var = Tkinter.StringVar()
> Tkinter.Entry(root, textvariable=var).pack()
>
> Typing CTR+ALT+u into the entry and then calling
>
> print var.get()
>
> produces a small amount of garbage that looks nothing like the u with
> acute accent that appears in the entry box.
In fact, it *is* the accented u that you typed in the entry, except it appears
to be coded in UTF-8, which is the base Tk encoding. Considering the default
encoding for the console where your print statements go is certainly the
"standard" latin1 encoding, doing:
print unicode(var.get(), 'utf-8').encode('iso8859-1', 'replace')
instead of just "print var.get()" should do what you want.
To explain it simply, doing unicode(s, encodingName) attaches the encoding to
the string. The result is a unicode string, which is a bit more than a plain
string since it knows its encoding. Then, you can re-encode the unicode string
in another encoding by calling its encode method. The first parameter is the new
encoding (in the example, 'iso8859-1' is just the real name for latin1), and the
second one tells what to do if a charcater originally in the string has no
equivalent in the new encoding. Passing 'replace' ensures that no exception will
be raised; the missing character will just be replaced by a '?'. The result of
the encode method is a new plain string.
However, I've no idea on how to get HTML sequences from these accented
characters, whatever the encoding. Any idea, anyone?
HTH
--
- Eric Brunel <eric.brunel at pragmadev.com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
More information about the Python-list
mailing list