editing in Unicode

Neil Hodgson neilh at scintilla.org
Thu Sep 7 06:31:53 EDT 2000


Bertilo Wennergren wrote:

> What if I want to edit my Python code directly in a Unicode text
> editor that can display all characters I want to use, and that can
> save the code in utf-8 or utf-16? How do I write my text strings
> so the compiler gets it right?

   First get a hold of a Unicode capable editor. If you are on Windows, you
can use PythonWin (the new ActivePython build 100 will work), SciTE
(http://www.scintilla.org/SciTE.html) in UTF8 mode (with the setting:
code.page=65001) or IDLE. These editors can run in UTF-8 mode but not in
UCS-2 or UTF-16. Umm, UTF-8 did work in the IDLE that came with betas of
Python 1.6, but I just downloaded 2.0b1 and its not working any more.

   You may need to set the font to one that covers the character glyphs you
want to see. Tahoma is a good choice as it contains a large set of glyphs.

   You can now write

x = "@"

   where @ is actually a non-roman character. There are several ways of
adding the particular character you want. You can set the keyboard locale to
another one, such as Russian and type the character. Press the 'q' key and
you may see [cyrillic small letter tse]. Alternatively, you may use a
character map or on screen keyboard applet.

   If you now look at contents of x it should be '\320\271', the UTF-8
representation of the mentioned character. Python strings are really
byte-buffers - there is no encoding value associated with each string
although they will most commonly contain ASCII strings. To convert this to a
Unicode string use the unicode built in function:

y = unicode(x,"UTF8")

   The second argument is the encoding that the first argument is in. "UTF8"
is supposed to be the default for the second argument so it should be
possible to omit it but that appears to not work in the version
(ActivePython based on 1.6 beta) I am using. If doing this in real code, its
more likely you'd collapse the code down to:

msg=unicode("@#&", "UTF8")

   Unfortunately most Python libraries do not yet accept Unicode strings,
even the win32* modules which should be enabled for wide strings.

   I'm thinking of writing an editing mode for PythonWin and SciTE that maps
\u escape sequences to/from the correct glyphs so you would be able to see
and write

msg=L"@"

   which would directly create a Unicode string. This would be converted
from \u sequences on input and back to \u sequences on output. A benefit of
this is that the resulting files would be sensibly editable with ASCII only
editors.

   Neil





More information about the Python-list mailing list