hex dump w/ or w/out utf-8 chars

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jul 11 23:18:44 EDT 2013


On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:

> And what to say about this "ucs4" char/string '\U0001d11e' which is
> weighting 18 bytes more than an "a".
> 
>>>> sys.getsizeof('\U0001d11e')
> 44
> 
> A total absurdity. 


You should stick to Python 3.1 and 3.2 then:

py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36


Now all your strings will be just as heavy, every single variable name 
and attribute name will use four times as much memory. Happy now?


> How does is come? Very simple, once you split Unicode
> in subsets, not only you have to handle these subsets, you have to
> create "markers" to differentiate them. Not only, you produce "markers",
> you have to handle the mess generated by these "markers". Hiding this
> markers in the everhead of the class does not mean that they should not
> be counted as part of the coding scheme. BTW, since when a serious
> coding scheme need an extermal marker?

Since always.

How do you think that (say) a C compiler can tell the difference between 
the long 1199876496 and the float 67923.125? They both have exactly the 
same four bytes:

py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'


*Everything* in a computer is bytes. The only way to tell them apart is 
by external markers.



-- 
Steven



More information about the Python-list mailing list