editing in Unicode

Thu Sep 7 06:49:10 EDT 2000

Neil Hodgson:

(Thanks for the snappy answer!)

> Bertilo Wennergren wrote:

> > What if I want to edit my Python code directly in a Unicode text
> > editor that can display all characters I want to use, and that can
> > save the code in utf-8 or utf-16? How do I write my text strings
> > so the compiler gets it right?

>    First get a hold of a Unicode capable editor.

That I already have.

> [...]
>    You can now write

> x = "@"

>    If you now look at contents of x it should be '\320\271', the UTF-8
> representation of the mentioned character. Python strings are really
> byte-buffers - there is no encoding value associated with each string
> although they will most commonly contain ASCII strings. To convert this to
a
> Unicode string use the unicode built in function:

> y = unicode(x,"UTF8")

Is there no way of avoiding this additional step, getting Python to always
automatically treat all strings as UTF-8 encoded Unicode strings? If I need
a lot of Unicode text strings it's a big bother to always have to explicitly
convert each and every one of them. A possible source of bugs, I'd say...

>    The second argument is the encoding that the first argument is in.
"UTF8"
> is supposed to be the default for the second argument so it should be
> possible to omit it but that appears to not work in the version
> (ActivePython based on 1.6 beta) I am using.

I'll try this in a newer version.

> If doing this in real code, its
> more likely you'd collapse the code down to:

> msg=unicode("@#&", "UTF8")

If I get this right the following simpler version ought to work:

msg=unicode("@#&")

Right? That I could live with.

What about using this:

msg = u'@#&'

?

>    Unfortunately most Python libraries do not yet accept Unicode strings,
> even the win32* modules which should be enabled for wide strings.

:-(

>    I'm thinking of writing an editing mode for PythonWin and SciTE that
maps
> \u escape sequences to/from the correct glyphs so you would be able to see
> and write
>
> msg=L"@"
>
>    which would directly create a Unicode string. This would be converted
> from \u sequences on input and back to \u sequences on output. A benefit
of
> this is that the resulting files would be sensibly editable with ASCII
only
> editors.

Great idea. I think my Unicode editor (UniRed) can already do this (or can
be made to do it with minimal fiddling).

--
#####################################################################
                         Bertilo Wennergren
                 <http://purl.oclc.org/net/bertilo>
                     <bertilow at hem.passagen.se>
#####################################################################