[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sun Jan 12 18:22:21 CET 2014

On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
> On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve at pearwood.info>wrote:
> 
> > On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
> >
> > > AFAIK (and just for the record), there could be both Latin1 text and
> > UTF-16
> > > in a PDF (and other encodings too), depending on the font used:
> > [...]
> > > In Python2, txt is just a str, but in Python3 handling everything as
> > latin1
> > > string obviously doesn't work for TTF in this case.
> >
> > Nobody is suggesting that you use Latin-1 for *everything*. We're
> > suggesting that you use it for blobs of binary data that represent
> > arbitrary bytes. First you have to get your binary data in the first
> > place, using whatever technique is necessary.
> 
> 
> Just to check I understood what you are saying. Instead of writing:
> 
>     content = b'\n'.join([
>         b'header',
>         b'part 2 %.3f' % number,
>         binary_image_data,
>         utf16_string.encode('utf-16be'),
>         b'trailer'])

Which doesn't work, since bytes don't support %f in Python 3.

> it should now look like:
> 
>     content = '\n'.join([
>         'header',
>         'part 2 %.3f' % number,
>         binary_image_data.decode('latin-1'),
>         utf16_string.encode('utf-16be').decode('latin-1'),
>         'trailer']).encode('latin-1')
> 
> Correct?

Not quite as you show.

First, "utf16_string" confuses me. What is it? If it is a Unicode 
string, i.e.:

# Python 3 semantics
type(utf16_string)
=> returns str

then the name is horribly misleading, and it is best handled like this:

    content = '\n'.join([
        'header',
        'part 2 %.3f' % number,
        binary_image_data.decode('latin-1'),
        utf16_string,  # Misleading name, actually Unicode string
        'trailer'])

Note that since it's text, and content is text, there is no need to 
encode then decode.

"UTF-16" is not another name for "Unicode". Unicode is a character set. 
UTF-16 is just one of a number of different encodings which map the 
0x10FFFF distinct Unicode characters (actually "code points") to bytes. 
UTF-16 is one possible way to implement Unicode strings in memory, but 
not the only way. Python has, or does, use four distinct implementations:

1) UTF-16 in "narrow builds"
2) UTF-32 in "wide builds"
3) a hybrid approach starting in Python 3.3, where strings are
   stored as either:

   3a) Latin-1
   3b) UCS-2
   3c) UTF-32

   depending on the content of the string.

So calling an arbitrary string "utf16_string" is misleading or wrong.

On the other hand, if it is actually a bytes object which is the product 
of UTF-16 encoding, i.e.:

type(utf16_string)
=> returns bytes

and those bytes were generated by "some text".encode("utf-16"), then it 
is already binary data and needs to be smuggled into the text string. 
Latin-1 is good for that:

    content = '\n'.join([
        'header',
        'part 2 %.3f' % number,
        binary_image_data.decode('latin-1'),
        utf16_string.decode('latin-1'),
        'trailer'])

Both examples assume that you intend to do further processing of content 
before sending it, and will encode just before sending:

    content.encode('utf-8')

(Don't use Latin-1, since it cannot handle the full range of text 
characters.)

If that's not the case, then perhaps this is better suited to what you 
are doing:

    content = b'\n'.join([
        b'header',
        ('part 2 %.3f' % number).encode('ascii'),
        binary_image_data,  # already bytes
        utf16_string,  # already bytes
        b'trailer'])

-- 
Steven