[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sat Jan 11 06:36:42 CET 2014

On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:

> As you may know, PDF operates over bytes and an integer or floating-point
> number is written down as-is, for example "100" or "1.23".

I'm sorry, I don't understand what you mean here. I'm honestly not 
trying to be difficult, but you sound confident that you understand what 
you are doing, but your description doesn't make sense to me. To me, it 
looks like you are conflating bytes and ASCII characters, that is, 
assuming that characters "are" in some sense identical to their ASCII 
representation. Let me explain:

The integer that in English is written as 100 is represented in memory 
as bytes 0x0064 (assuming a big-endian C short), so when you say "an 
integer is written down AS-IS" (emphasis added), to me that says that 
the PDF file includes the bytes 0x0064. But then you go on to write the 
three character string "100", which (assuming ASCII) is the bytes 
0x313030. Going from the C short to the ASCII representation 0x313030 is 
nothing like inserting the int "as-is". To put it another way, the 
Python 2 '%d' format code does not just copy bytes.

I think that what you are trying to say is that a PDF file is a binary 
file which includes some ASCII-formatted text fields. So when writing an 
integer 100, rather than writing it "as is" which would be byte 0x64 
(with however many leading null bytes needed for padding), it is 
converted to ASCII representation 0x313030 first, and that's what needs 
to be inserted.

If you consider PDF as binary with occasional pieces of ASCII text, then 
working with bytes makes sense. But I wonder whether it might be better 
to consider PDF as mostly text with some binary bytes. Even though the 
bulk of the PDF will be binary, the interesting bits are text. E.g. your 
example:

> In the case of PDF, the embedding of an image into PDF looks like:
> 
>     10 0 obj
>       << /Type /XObject
>          /Width 100
>          /Height 100
>          /Alternates 15 0 R
>          /Length 2167
>       >>
>     stream
>     ...binary image data...
>     endstream
>     endobj

Even though the binary image data is probably much, much larger in 
length than the text shown above, it's (probably) trivial to deal with: 
convert your image data into bytes, decode those bytes into Latin-1, 
then concatenate the Latin-1 string into the text above.

Latin-1 has the nice property that every byte decodes into the character 
with the same code point, and visa versa. So:

for i in range(256):
    assert bytes([i]).decode('latin-1') == chr(i)
    assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode 
text with embedded binary data, rather than binary data with embedded 
ASCII text. Then when writing the file to disk, of course you encode it 
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open("outfile.pdf", "wb") as f:
    f.write(pdf.encode("latin-1")

or implicitly:

with open("outfile.pdf", "w", encoding="latin-1") as f:
    f.write(pdf)

There may be a few wrinkles I haven't thought of, I don't claim to be an 
expert on PDF. But I see no reason why PDF files ought to be an 
exception to the rule:

    * work internally with Unicode text;

    * convert to and from bytes only on input and output.

Please also take note that in Python 3.3 and better, the internal 
representation of Unicode strings containing only code points up to 255 
(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte 
per character.

Another advantage is that using text rather than bytes means that your 
example:

[...]
> dropping the bytes-formatting of numbers makes it more complicated
> than it was. I would appreciate any explanation on how:
> 
>     b'%.1f %.1f %.1f RG' % (r, g, b)

becomes simply

    '%.1f %.1f %.1f RG' % (r, g, b)

in Python 3. In Python 3.3 and above, it can be written as:

    u'%.1f %.1f %.1f RG' % (r, g, b)

which conveniently is exactly the same syntax you would use in Python 2. 
That's *much* nicer than your suggestion:

> is more confusing than:
> 
>     b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), 
>      (r, g, b)))

-- 
Steven