unicode issue

Dave Angel davea at ieee.org
Thu Oct 1 06:50:21 EDT 2009


gentlestone wrote:
>> save in utf-8 the coding declaration also has to be utf-8
>>     
>
> ok, I understand, but what's the problem? Unfortunately seems to be
> the Python interactive
> mode doesn't have unicode support. It recognize the latin-1 encoding
> only.
>
> So I have 2 options, how to write doctest:
> 1. Replace native charaters with their encoded representation like
> u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
> zmiešaná kaša"
> 2. Use latin-1 encoding, where the file is saved in utf-8
>
> The first is bad because doctest is a great documenttion tool and it
> is propably the main reason I use python. And something like
> u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
> documentation style. But the tests work.
>
> The second is bad, because the declaration is incorrect and if I use
> it in Django model declaration for example I got bad data in the
> application.
>
> So what is the solution? Back to Java? :-)
>
>   
Wait -- don't give up yet.  Since I'm one of the ones who (partially) 
steered you wrong, let me try to help.

Key variable here is how your text editor behaves.  Since I've never 
taken my (programming) text editor out of ASCII mode before this week, 
it took some experimenting (and more importantly a message from Piet on 
this thread) to make sense of things.  I think I now know how to make my 
own editor (Komodo IDE) behave in this environment, and you probably can 
do as well or better.  In fact, judging from your messages, you probably 
are doing much better on the editor front.

When I tried this morning to re-open that test file from yesterday, many 
of the characters were all messed up.  I was okay as long as the project 
was still open, but not today.  The editor itself apparently looks to 
that encoding declaration when it's deciding how to interpret the bytes 
on disk.

So I did the following, using Komodo IDE.  I created a new file in the 
project.  Before saving it, I used 
Edit->CurrentFileSettings->Properties->Encoding to set it to UTF-8.  
*NOW* I pasted the stuff from your email message.  And added the
#-*- coding: utf-8 -*-

as the second line of the file.   Notice it's *NOT* latin-1.

At this point I save and run the file, and it seems to work fine.

My guess is that I could set these as default settings in Komodo, if I 
were doing UTF-8 very often, and it would become painless.  I know I 
have certain stuff in my python template, and could add that encoding 
line as well.


Anyway, that gets us to the step of running the doctest.  The trick here 
seems to be that we need to define the docstring as a Unicode docstring 
to have it interpreted correctly.  Try adding the u in front of the 
triple quote as follows:

def downcode(name):
    u"""
    >>> downcode(u"Žabovitá zmiešaná kaša")
    u'Zabovita zmiesana kasa'
    """
    for key, value in _MAP.iteritems():
        name = name.replace(key, value)
    return name

Now, if the doctest passes, we seem to be in good shape.

There's another problem, that hopefully somebody else can help with.  
That's if doctest needs to report an error.  When I deliberately changed 
the "expect" string I get an error like the following.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' in 
position 1
50: ordinal not in range(128)

I get a similar error if running the -v option on doctest.   (Note that 
I do *NOT* get the error when running inside Komodo.  And what I've read 
implies that the same would be true if running inside IDLE.)  The 
problem is similar to the one you'd have doing a simple:

    print u"\u017d"

I think these are avoided if  sys.stdout.encoding (and maybe 
sys.stderr.encoding) are set to utf-8.  On my system they're set to 
None, which says to use "the system default encoding."  On my system 
that would be ASCII, so I get the error.  But perhaps yours is already 
something better.

I found links:  
http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/
                     http://wiki.python.org/moin/PrintFails
                     
http://lists.macromates.com/textmate/2008-June/025735.html
   which indicate you may want to try:  

set LC_CTYPE=en_GB.utf-8 python

at the command prompt before running python.  This could be system specific;  it didn't work for me on XP.

The workaround that works for me (so far) is:

if __name__ == "__main__":
    import sys, codecs
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

    print u"Žabovitá zmiešaná kaša"
    import doctest
    doctest.testmod()

The codecs line tells python that stdout should use utf-8.  That doesn't make the characters look good on my console, but at least it avoids the errors.  I'm guessing that on my system I should use latin1 here instead of utf8.  But I don't want to confuse things.


HTH

DaveA




More information about the Python-list mailing list