[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sun Jan 12 05:38:49 CET 2014

On 01/11/2014 06:29 PM, Steven D'Aprano wrote:
> On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
>> On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
>>> On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
>>>>
>>>>    unicode to bytes
>>>>    bytes to unicode using latin1
>>>>    unicode to bytes
>>>
>>> Where do you get this from? I don't follow your logic. Start with a text
>>> template:
>>>
>>> template = """\xDE\xAD\xBE\xEF
>>> Name:\0\0\0%s
>>> Age:\0\0\0\0%d
>>> Data:\0\0\0%s
>>> blah blah blah
>>> """
>>>
>>> data = template % ("George", 42, blob.decode('latin-1'))
>
> Since the use-cases people have been speaking about include only ASCII
> (or at most, Latin-1) text and arbitrary binary bytes, my example is
> limited to showing only ASCII text. But it will work with any text data,
> so long as you have a well-defined format that lets you tell which parts
> are interpreted as text and which parts as binary data.

Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: 
ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text.

And no, your example will not work with any text, it would completely moji-bake my dbf files.

>>> Only the binary blobs need to be decoded. We don't need to encode the
>>> template to bytes, and the textual data doesn't get encoded until we're
>>> ready to send it across the wire or write it to disk.

No!  When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till 
the end!

>> And what if your name field has data not representable in latin-1?
>>
>> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
>> u'\u0441\u0440\u0403'
>
> Where did you get those bytes from? You got them from somewhere.

For the sake of argument, pretend a user entered them in.

> Who knows? Who cares? Once you have bytes, you can treat them as a blob of
> arbitrary bytes and write them to the record using the Latin-1 trick.

No, I can't.  See above.

>  If
> you're reading those bytes from some stream that gives you bytes, you
> don't have to care where they came from.

You're kidding, right?  If I don't know where they came from (a graphics field?  a note field?) how am I going to know 
how to treat them?

> But what if you don't start with bytes? If you start with a bunch of
> floats, you'll probably convert them to bytes using the struct module.

Yup, and I do.

> If you start with non-ASCII text, you have to convert them to bytes too.
> No difference here.

Really?  You just said above that "it will work with any text data" -- you can't have it both ways.

> You ask the user for their name, they answer "срЃ" which is given to you
> as a Unicode string, and you want to include it in your data record. The
> specifications of your file format aren't clear, so I'm going to assume
> that:
>
> 1) ASCII text is allowed "as-is" (that is, the name "George" will be
>     in the final data file as b'George');

User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII.  The user 
text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.

> 2) any other non-ASCII text will be encoded as some fixed encoding
>     which we can choose to suit ourselves;

Well, the user chooses it, we have to abide by their choice.  (It's kept in the file metadata.)

> 3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up
>     being written as byte N, for any value of N between 0 and 255).

In a couple field types, yes.  Usually the binary data is numeric or date related and there is conversion going on 
there, too, to give me the bytes I need.

[snip]

>> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'latin-1' codec can't encode characters in position
>> 0-2: ordinal not in range(256)
>
> That is backwards to what I've shown. Look at my earlier example again:

And you are not paying attention:

'\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
\--------------------------------------/  \-------------/
  a non-ascii compatible unicode string      to latin1 bytes

("срЃ".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1'))
       \----------------------------------------------/  \--------------/
                getting the actual bytes I need            and back into unicode until I write them later

You did say to use a *text* template to manipulate my data, and then write it later, no?  Well, this is what it would 
look like.

> Bytes get DECODED to latin-1, not encoded.
>
> Bytes -> text is *decoding*
> Text -> bytes is *encoding*

Pretend for a moment I know that, and look at my examples again.

I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible:  It must be ENcoded using the 
appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the bloomin' unicode 
data structure back to bytes using latin1 again.  Dizzy yet?

And you must know this, because it is what your bytify function does.  Are you trolling?

--
~Ethan~