[Python-Dev] Unicode proposal: %-formatting ?

M.-A. Lemburg mal@lemburg.com
Tue, 16 Nov 1999 11:40:42 +0100


Tim Peters wrote:
> 
> [MAL]
> > I wonder how we could add %-formatting to Unicode strings without
> > duplicating the PyString_Format() logic.
> >
> > First, do we need Unicode object %-formatting at all ?
> 
> Sure -- in the end, all the world speaks Unicode natively and encodings
> become historical baggage.  Granted I won't live that long, but I may last
> long enough to see encodings become almost purely an I/O hassle, with all
> computation done in Unicode.
> 
> > Second, here is an emulation using strings and <default encoding>
> > that should give an idea of one could work with the different
> > encodings:
> >
> >     s = '%s %i abcäöü' # a Latin-1 encoded string
> >     t = (u,3)
> 
> What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
> string?  How does the following know the difference?

u refers to a Unicode object in the proposal. Sorry, forgot to
mention that.
 
> >     # Convert Latin-1 s to a <default encoding> string via Unicode
> >     s1 = unicode(s,'latin-1').encode()
> >
> >     # The '%s' will now add u in <default encoding>
> >     s2 = s1 % t
> >
> >     # Finally, convert the <default encoding> encoded string to Unicode
> >     u1 = unicode(s2)
> 
> I don't expect this actually works:  for example, change %s to %4s.
> Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
> know that some (or all) characters in u consume multiple bytes, so can't
> extract "the right" number of bytes from u.  I think % formating has to know
> the truth of what you're doing.

Hmm, guess you're right... format parameters should indeed refer
to characters rather than number of encoding bytes.

This means a new PyUnicode_Format() implementation mapping
Unicode format objects to Unicode objects.
 
> > Note that .encode() defaults to the current setting of
> > <default encoding>.
> >
> > Provided u maps to Latin-1, an alternative would be:
> >
> >     u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')
> 
> More interesting is fmt % tuple where everything is Unicode; people can muck
> with Latin-1 directly today using regular strings, so the example above
> mostly shows artificial convolution.

... hmm, there is a problem there: how should the PyUnicode_Format()
API deal with '%s' when it sees a Unicode object as argument ?

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

Perhaps we need a new marker for "insert Unicode object here".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/