[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano
steve at pearwood.info
Sun Jan 12 18:22:21 CET 2014
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
> On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve at pearwood.info>wrote:
>
> > On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
> >
> > > AFAIK (and just for the record), there could be both Latin1 text and
> > UTF-16
> > > in a PDF (and other encodings too), depending on the font used:
> > [...]
> > > In Python2, txt is just a str, but in Python3 handling everything as
> > latin1
> > > string obviously doesn't work for TTF in this case.
> >
> > Nobody is suggesting that you use Latin-1 for *everything*. We're
> > suggesting that you use it for blobs of binary data that represent
> > arbitrary bytes. First you have to get your binary data in the first
> > place, using whatever technique is necessary.
>
>
> Just to check I understood what you are saying. Instead of writing:
>
> content = b'\n'.join([
> b'header',
> b'part 2 %.3f' % number,
> binary_image_data,
> utf16_string.encode('utf-16be'),
> b'trailer'])
Which doesn't work, since bytes don't support %f in Python 3.
> it should now look like:
>
> content = '\n'.join([
> 'header',
> 'part 2 %.3f' % number,
> binary_image_data.decode('latin-1'),
> utf16_string.encode('utf-16be').decode('latin-1'),
> 'trailer']).encode('latin-1')
>
> Correct?
Not quite as you show.
First, "utf16_string" confuses me. What is it? If it is a Unicode
string, i.e.:
# Python 3 semantics
type(utf16_string)
=> returns str
then the name is horribly misleading, and it is best handled like this:
content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string, # Misleading name, actually Unicode string
'trailer'])
Note that since it's text, and content is text, there is no need to
encode then decode.
"UTF-16" is not another name for "Unicode". Unicode is a character set.
UTF-16 is just one of a number of different encodings which map the
0x10FFFF distinct Unicode characters (actually "code points") to bytes.
UTF-16 is one possible way to implement Unicode strings in memory, but
not the only way. Python has, or does, use four distinct implementations:
1) UTF-16 in "narrow builds"
2) UTF-32 in "wide builds"
3) a hybrid approach starting in Python 3.3, where strings are
stored as either:
3a) Latin-1
3b) UCS-2
3c) UTF-32
depending on the content of the string.
So calling an arbitrary string "utf16_string" is misleading or wrong.
On the other hand, if it is actually a bytes object which is the product
of UTF-16 encoding, i.e.:
type(utf16_string)
=> returns bytes
and those bytes were generated by "some text".encode("utf-16"), then it
is already binary data and needs to be smuggled into the text string.
Latin-1 is good for that:
content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.decode('latin-1'),
'trailer'])
Both examples assume that you intend to do further processing of content
before sending it, and will encode just before sending:
content.encode('utf-8')
(Don't use Latin-1, since it cannot handle the full range of text
characters.)
If that's not the case, then perhaps this is better suited to what you
are doing:
content = b'\n'.join([
b'header',
('part 2 %.3f' % number).encode('ascii'),
binary_image_data, # already bytes
utf16_string, # already bytes
b'trailer'])
--
Steven
More information about the Python-Dev
mailing list