[Python-Dev] Unicode proposal: %-formatting ?
M.-A. Lemburg
mal@lemburg.com
Tue, 16 Nov 1999 11:40:42 +0100
Tim Peters wrote:
>
> [MAL]
> > I wonder how we could add %-formatting to Unicode strings without
> > duplicating the PyString_Format() logic.
> >
> > First, do we need Unicode object %-formatting at all ?
>
> Sure -- in the end, all the world speaks Unicode natively and encodings
> become historical baggage. Granted I won't live that long, but I may last
> long enough to see encodings become almost purely an I/O hassle, with all
> computation done in Unicode.
>
> > Second, here is an emulation using strings and <default encoding>
> > that should give an idea of one could work with the different
> > encodings:
> >
> > s = '%s %i abcäöü' # a Latin-1 encoded string
> > t = (u,3)
>
> What's u? A Unicode object? Another Latin-1 string? A default-encoded
> string? How does the following know the difference?
u refers to a Unicode object in the proposal. Sorry, forgot to
mention that.
> > # Convert Latin-1 s to a <default encoding> string via Unicode
> > s1 = unicode(s,'latin-1').encode()
> >
> > # The '%s' will now add u in <default encoding>
> > s2 = s1 % t
> >
> > # Finally, convert the <default encoding> encoded string to Unicode
> > u1 = unicode(s2)
>
> I don't expect this actually works: for example, change %s to %4s.
> Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
> know that some (or all) characters in u consume multiple bytes, so can't
> extract "the right" number of bytes from u. I think % formating has to know
> the truth of what you're doing.
Hmm, guess you're right... format parameters should indeed refer
to characters rather than number of encoding bytes.
This means a new PyUnicode_Format() implementation mapping
Unicode format objects to Unicode objects.
> > Note that .encode() defaults to the current setting of
> > <default encoding>.
> >
> > Provided u maps to Latin-1, an alternative would be:
> >
> > u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')
>
> More interesting is fmt % tuple where everything is Unicode; people can muck
> with Latin-1 directly today using regular strings, so the example above
> mostly shows artificial convolution.
... hmm, there is a problem there: how should the PyUnicode_Format()
API deal with '%s' when it sees a Unicode object as argument ?
E.g. what would you get in these cases:
u = u"%s %s" % (u"abc", "abc")
Perhaps we need a new marker for "insert Unicode object here".
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 45 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/