[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Stephen J. Turnbull stephen at xemacs.org
Sun Jan 12 23:31:16 CET 2014


Steven D'Aprano writes:

 > then the name is horribly misleading, and it is best handled like this:
 > 
 >     content = '\n'.join([
 >         'header',
 >         'part 2 %.3f' % number,
 >         binary_image_data.decode('latin-1'),
 >         utf16_string,  # Misleading name, actually Unicode string
 >         'trailer'])

This loses bigtime, as any encoding that can handle non-latin1 in
utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
on non-latin1 characters.  utf16_string must be encoded appropriately
then decoded by latin1 to be reencoded by latin1 on output.

 > On the other hand, if it is actually a bytes object which is the product 
 > of UTF-16 encoding, i.e.:
 > 
 > type(utf16_string)
 > => returns bytes
 > 
 > and those bytes were generated by "some text".encode("utf-16"), then it 
 > is already binary data and needs to be smuggled into the text string. 
 > Latin-1 is good for that:
 > 
 >     content = '\n'.join([
 >         'header',
 >         'part 2 %.3f' % number,
 >         binary_image_data.decode('latin-1'),
 >         utf16_string.decode('latin-1'),
 >         'trailer'])
 > 
 > 
 > Both examples assume that you intend to do further processing of content 
 > before sending it, and will encode just before sending:
 > 
 >     content.encode('utf-8')
 > 
 > (Don't use Latin-1, since it cannot handle the full range of text 
 > characters.)

This corrupts binary_image_data.  Each byte > 127 will be replaced by
two bytes.  In the second case, you can use latin1 to encode, it it
gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1
to smuggle bytes.



More information about the Python-Dev mailing list