[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano
steve at pearwood.info
Sat Jan 11 06:36:42 CET 2014
On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:
> As you may know, PDF operates over bytes and an integer or floating-point
> number is written down as-is, for example "100" or "1.23".
I'm sorry, I don't understand what you mean here. I'm honestly not
trying to be difficult, but you sound confident that you understand what
you are doing, but your description doesn't make sense to me. To me, it
looks like you are conflating bytes and ASCII characters, that is,
assuming that characters "are" in some sense identical to their ASCII
representation. Let me explain:
The integer that in English is written as 100 is represented in memory
as bytes 0x0064 (assuming a big-endian C short), so when you say "an
integer is written down AS-IS" (emphasis added), to me that says that
the PDF file includes the bytes 0x0064. But then you go on to write the
three character string "100", which (assuming ASCII) is the bytes
0x313030. Going from the C short to the ASCII representation 0x313030 is
nothing like inserting the int "as-is". To put it another way, the
Python 2 '%d' format code does not just copy bytes.
I think that what you are trying to say is that a PDF file is a binary
file which includes some ASCII-formatted text fields. So when writing an
integer 100, rather than writing it "as is" which would be byte 0x64
(with however many leading null bytes needed for padding), it is
converted to ASCII representation 0x313030 first, and that's what needs
to be inserted.
If you consider PDF as binary with occasional pieces of ASCII text, then
working with bytes makes sense. But I wonder whether it might be better
to consider PDF as mostly text with some binary bytes. Even though the
bulk of the PDF will be binary, the interesting bits are text. E.g. your
example:
> In the case of PDF, the embedding of an image into PDF looks like:
>
> 10 0 obj
> << /Type /XObject
> /Width 100
> /Height 100
> /Alternates 15 0 R
> /Length 2167
> >>
> stream
> ...binary image data...
> endstream
> endobj
Even though the binary image data is probably much, much larger in
length than the text shown above, it's (probably) trivial to deal with:
convert your image data into bytes, decode those bytes into Latin-1,
then concatenate the Latin-1 string into the text above.
Latin-1 has the nice property that every byte decodes into the character
with the same code point, and visa versa. So:
for i in range(256):
assert bytes([i]).decode('latin-1') == chr(i)
assert chr(i).encode('latin-1') == bytes([i])
passes. It seems to me that your problem goes away if you use Unicode
text with embedded binary data, rather than binary data with embedded
ASCII text. Then when writing the file to disk, of course you encode it
to Latin-1, either explicitly:
pdf = ... # Unicode string containing the PDF contents
with open("outfile.pdf", "wb") as f:
f.write(pdf.encode("latin-1")
or implicitly:
with open("outfile.pdf", "w", encoding="latin-1") as f:
f.write(pdf)
There may be a few wrinkles I haven't thought of, I don't claim to be an
expert on PDF. But I see no reason why PDF files ought to be an
exception to the rule:
* work internally with Unicode text;
* convert to and from bytes only on input and output.
Please also take note that in Python 3.3 and better, the internal
representation of Unicode strings containing only code points up to 255
(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte
per character.
Another advantage is that using text rather than bytes means that your
example:
[...]
> dropping the bytes-formatting of numbers makes it more complicated
> than it was. I would appreciate any explanation on how:
>
> b'%.1f %.1f %.1f RG' % (r, g, b)
becomes simply
'%.1f %.1f %.1f RG' % (r, g, b)
in Python 3. In Python 3.3 and above, it can be written as:
u'%.1f %.1f %.1f RG' % (r, g, b)
which conveniently is exactly the same syntax you would use in Python 2.
That's *much* nicer than your suggestion:
> is more confusing than:
>
> b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'),
> (r, g, b)))
--
Steven
More information about the Python-Dev
mailing list