[Python-Dev] Pre-PEP: The "bytes" object

Fri Feb 17 07:24:57 CET 2006

On Thu, 16 Feb 2006 12:47:22 -0800, Guido van Rossum <guido at python.org> wrote:

>On 2/15/06, Neil Schemenauer <nas at arctrix.com> wrote:
>> This could be a replacement for PEP 332.  At least I hope it can
>> serve to summarize the previous discussion and help focus on the
>> currently undecided issues.
>>
>> I'm too tired to dig up the rules for assigning it a PEP number.
>> Also, there are probably silly typos, etc.   Sorry.
>
>I may check it in for you, although right now it would be good if we
>had some more feedback.
>
>I noticed one behavior in your pseudo-code constructor that seems
>questionable: while in the Q&A section you explain why the encoding is
>ignored when the argument is a str instance, in fact you require an
>encoding (and one that's not "ascii") if the str instance contains any
>non-ASCII bytes. So bytes("\xff") would fail, but bytes("\xff",
>"blah") would succeed. I think that's a bit strange -- if you ignore
>the encoding, you should always ignore it. So IMO bytes("\xff") and
>bytes("\xff", "ascii") should both return the same as bytes([255]).
>Also, there's a code path where the initializer is a unicode instance
>and its encode() method is called with None as the argument. I think
>both could be fixed by setting the encoding to
>sys.getdefaultencoding() if it is None and the argument is a unicode
>instance:
>
>    def bytes(initialiser=[], encoding=None):
>        if isinstance(initialiser, basestring):
>            if isinstance(initialiser, unicode):
>                if encoding is None:
>                    encoding = sys.getdefaultencoding()
>                initialiser = initialiser.encode(encoding)
>            initialiser = [ord(c) for c in initialiser]
>        elif encoding is not None:
>            raise TypeError("explicit encoding invalid for non-string "
>                            "initialiser")
>        create bytes object and fill with integers from initialiser
>        return bytes object

Two things:
[1]--------

As the above shows, str is encoding-agnostic and passes through
unmodified to bytes (except by ord).

I am wondering what it would hurt to allow the same for unicode ords,
since unicode is also encoding-agnostic. Please read [2] before
deciding that you have already decided this ;-)

The beauty of a unicode literal IMO is that it launders away
the source encoding into a coding-agnostic character sequence
that has stable ords across the universe, so why not use them?
It also solves a lot of ecaping grief. But see [2]

After all, in either case, an encoding can be specified if so desired. Thus

     def bytes(initialiser=[], encoding=None):
         if isinstance(initialiser, basestring):
             if encoding:
                 initialiser = initialiser.encode(encoding) # XXX for str ?? see [2]
             initialiser = [ord(c) for c in initialiser]
         elif encoding is not None:
             raise TypeError("explicit encoding invalid for non-string "
                             "initialiser")
         create bytes object and fill with integers from initialiser
         return bytes object

[2]-------

One thing I wonder is where sys.getdefaultencoding() gets its info, and whether
a module_encoding is also necessary for str arguments with encoding.

E.g. if the source encoding is utf-8, and you want sys.getdefaultencoding()
finally, don't you first have to do decode from the source encoding, rather than
let the default decoding assumption for that be ascii? E.g. for utf-8 source,

    initialiser.decode('utf-8').encode(sys.getdefaultencodeing()) ?

works, but

    initialiser.encode(sys.getdefaultencodeing())  ?

bombs, because it tries to do .decode('ascii') in place of .decode('utf-8')

Notice where the following fails (where utf-8 source is written to tutf8.py
by tutf.py and using latin-1 as standin for sys.getdefaultencoding())

----< tutf.py >-------------------------------------------
def test():
    latin_1_src = """\
# -*- coding: utf-8 -*-
print '\\nfrom tutf8 import:'
print map(hex,map(ord, 'abc\xf6'))
print map(hex,map(ord,'abc\xf6'.decode('utf-8').encode('latin-1')))
print map(hex,map(ord,repr('abc\xf6'.encode('latin-1'))))
"""
    open('tutf8.py','wb').write(latin_1_src.decode('latin-1').encode('utf-8'))

if __name__ == '__main__':
    test()
    print '\ntutf8.py utf-8 binary line reprs:'
    print '\n'.join(repr(L) for L in open('tutf8.py','rb').read().splitlines())
    import tutf8
----------------------------------------------------------
The result:

[20:17] C:\pywk\pydev\pep0332>py24 tutf.py

tutf8.py utf-8 binary line reprs:
'# -*- coding: utf-8 -*-'
"print '\\nfrom tutf8 import:'"
"print map(hex,map(ord, 'abc\xc3\xb6'))"
"print map(hex,map(ord,'abc\xc3\xb6'.decode('utf-8').encode('latin-1')))"
"print map(hex,map(ord,repr('abc\xc3\xb6'.encode('latin-1'))))"

from tutf8 import:
['0x61', '0x62', '0x63', '0xc3', '0xb6']
['0x61', '0x62', '0x63', '0xf6']
Traceback (most recent call last):
  File "tutf.py", line 15, in ?
    import tutf8
  File "C:\pywk\pydev\pep0332\tutf8.py", line 5, in ?
    print map(hex,map(ord,repr('abc+¦'.encode('latin-1'))))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

I.e., if you leave out encoding for a str, you apparently get the native source
str representation of the literal, so it would seem that that must be undone
if you want to re-encode to anything else.

Should there be tutf8.__encoding__ available for this after import tutf8?
But that's interesting when str becomes unicode, and all literals will presumably have
an internal uniform unicode encoding, so the 'literal'.decode(source_encoding) will in effect already
have been done. What does a decode mean on unicode? It seems to mean blow up on non-ascii, so
that's not very portable. Why not use latin-1 as the default intermediate str representation when
doing a u'something'.decode(enc) ? The restriction to ascii in that context seems artificial.

IMHO and with all due respect ISTM the pain of all these considerations is not worth it when
the simple practicality of just prefixing a "u" on any ascii literal freely sprinkled
with escapes gets you exactly the bytes values you specify in any hex escapes. That's normally
what you want.

If by 'abc\xf6' you really mean the character with ord value 0xf6 in some encoding, then
bytes('abc\xf6'.decode(someenc), destenc) would be the way, so no one is stuck.

One danger is that someone is writing an in incomplete source character set and
wants to stick in some byte values in hex, happily sticking to the ascii subset
plus escapes, but a decode from the source encoding can fail on non-existent character
if the "ascii escape" is not in the source character set. E.g., cp1252 is pretty complete,
but

 >>> '\x81'.decode('cp1252')
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "d:\python-2.4b1\lib\encodings\cp1252.py", line 22, in decode
     return codecs.charmap_decode(input,errors,decoding_map)
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

This can't happen with the same literal of ascii plus escapes passed as a unicode literal, given
that map(ord, literal) is done on it to get bytes when no encoding is specified. You just get what you expect.
It seems practical to me. I'm really trying to help, not piss you off ;-)

BTW, I recently posted re str.translate vs unicode.translate, which has some tie-in with this, since
I anticipate that bytes.translate would be a useful thing in the absence of str.translate.
unicode.translate won't do all one might like to do with bytes.translate, I believe. Both
have uses.

>
>BTW, for folks who want to experiment, it's quite simple to create a
>working bytes implementation by inheriting from array.array. Here's a
>quick draft (which only takes str instance arguments):
>
>    from array import array
>    class bytes(array):
>        def __new__(cls, data=None):
>            b = array.__new__(cls, "B")
>            if data is not None:
>                b.fromstring(data)
>            return b
>        def __str__(self):
>            return self.tostring()
>        def __repr__(self):
>            return "bytes(%s)" % repr(list(self))
>        def __add__(self, other):
>            if isinstance(other, array):
>                return bytes(super(bytes, self).__add__(other))
>            return NotImplemented
>
Cool, thanks.

Regards,
Bengt Richter