[ python-Bugs-1067294 ] Incorrect length of unicode strings using .encode('utf-8')

Tue Nov 16 13:12:45 CET 2004

Bugs item #1067294, was opened at 2004-11-16 12:58
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1067294&group_id=5470

Category: Unicode
Group: Python 2.4
>Status: Closed
>Resolution: Works For Me
Priority: 5
Submitted By: Ed Schofield (edschofield)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Incorrect length of unicode strings using .encode('utf-8')

Initial Comment:

Python 2.3.4 and Python 2.4b2:

print "x = %-15s" %(x.encode('utf-8'),) + " more text"

gives an incorrect number of spaces when x is a
two-byte unicode character like à.  There is no such
problem if x is used alone rather than its encode(...)
method.

The reason seems to be this: if x = u'\u00e0' (the
character à) and s=x.encode('utf-8'), then len(s) = 2,
which breaks the print command above on a UTF-8 terminal.

A slightly longer example is attached.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-11-16 13:12

Message:
Logged In: YES 
user_id=38388

As you already noted: the problem is that you are mixing Unicode
and strings in a way which is bound to fail.

You should use:

print (u"x = %-15s" %x + u" more text").encode('utf-8')

ie. stay with Unicode as long as you can and only call encode
when doing I/O as last step before passing off the string
to an 8-bit stream.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1067294&group_id=5470