[Python-Dev] Unicode proposal: %-formatting ?

Tim Peters tim_one@email.msn.com
Tue, 16 Nov 1999 00:38:32 -0500


[MAL]
> I wonder how we could add %-formatting to Unicode strings without
> duplicating the PyString_Format() logic.
>
> First, do we need Unicode object %-formatting at all ?

Sure -- in the end, all the world speaks Unicode natively and encodings
become historical baggage.  Granted I won't live that long, but I may last
long enough to see encodings become almost purely an I/O hassle, with all
computation done in Unicode.

> Second, here is an emulation using strings and <default encoding>
> that should give an idea of one could work with the different
> encodings:
>
>     s = '%s %i abcäöü' # a Latin-1 encoded string
>     t = (u,3)

What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
string?  How does the following know the difference?

>     # Convert Latin-1 s to a <default encoding> string via Unicode
>     s1 = unicode(s,'latin-1').encode()
>
>     # The '%s' will now add u in <default encoding>
>     s2 = s1 % t
>
>     # Finally, convert the <default encoding> encoded string to Unicode
>     u1 = unicode(s2)

I don't expect this actually works:  for example, change %s to %4s.
Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
know that some (or all) characters in u consume multiple bytes, so can't
extract "the right" number of bytes from u.  I think % formating has to know
the truth of what you're doing.

> Note that .encode() defaults to the current setting of
> <default encoding>.
>
> Provided u maps to Latin-1, an alternative would be:
>
>     u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')

More interesting is fmt % tuple where everything is Unicode; people can muck
with Latin-1 directly today using regular strings, so the example above
mostly shows artificial convolution.