Python 3.2 has some deadly infection

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Jun 6 11:48:01 EDT 2014


Le vendredi 6 juin 2014 17:44:57 UTC+2, wxjm... at gmail.com a écrit :
> Le vendredi 6 juin 2014 17:25:47 UTC+2, Chris Angelico a écrit :
> 
> > On Fri, Jun 6, 2014 at 11:24 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
> 
> > 
> 
> > > On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:
> 
> > 
> 
> > >>
> 
> > 
> 
> > >>
> 
> > 
> 
> > >> How text is represented is very different from whether text is a
> 
> > 
> 
> > >> fundamental data type. A fundamental text file is such that ordinary
> 
> > 
> 
> > >> operating system facilities can't see inside the black box (that is,
> 
> > 
> 
> > >> they are *not* encoded as far as the applications go).
> 
> > 
> 
> > >
> 
> > 
> 
> > > Of course they are.  It may be an ASCII-encoding of some flavor or other, or
> 
> > 
> 
> > > something really (to me) strange -- but an encoding is most assuredly in
> 
> > 
> 
> > > affect.
> 
> > 
> 
> > 
> 
> > 
> 
> > Allow me to explain what I think Marko's getting at here.
> 
> > 
> 
> > 
> 
> > 
> 
> > In most file systems, a file exists on the disk as a set of sectors of
> 
> > 
> 
> > data, plus some metadata including the file's actual size. When you
> 
> > 
> 
> > ask the OS to read you that file, it goes to the disk, reads those
> 
> > 
> 
> > sectors, truncates the data to the real size, and gives you those
> 
> > 
> 
> > bytes.
> 
> > 
> 
> > 
> 
> > 
> 
> > It's possible to mount a file as a directory, in which case the
> 
> > 
> 
> > physical representation is very different, but the file still appears
> 
> > 
> 
> > the same. In that case, the OS goes reading some part of the file,
> 
> > 
> 
> > maybe decompresses it, and gives it to you. Same difference. These
> 
> > 
> 
> > files still contain bytes.
> 
> > 
> 
> > 
> 
> > 
> 
> > A "fundamental text file" would be one where, instead of reading and
> 
> > 
> 
> > writing bytes, you read and write Unicode text. Since the hard disk
> 
> > 
> 
> > still works with sectors and bytes, it'll still be stored as such, but
> 
> > 
> 
> > that's an implementation detail; and you could format your disk UTF-8
> 
> > 
> 
> > or UTF-16 or FSR or anything you like, and the only difference you'd
> 
> > 
> 
> > see is performance.
> 
> > 
> 
> > 
> 
> > 
> 
> > This could certainly be done, in theory. I don't know how well it'd
> 
> > 
> 
> > fit with any of the popular OSes of today, but it could be done. And
> 
> > 
> 
> > these files would not have an encoding; their on-platter
> 
> > 
> 
> > representations would, but that's purely implementation - the text
> 
> > 
> 
> > that you wrote out and the text that you read in are the same text,
> 
> > 
> 
> > and there's been no encoding visible.
> 
> > 
> 
> > 
> 
> ----------
> 
> 
> 
> From the three, you can already eliminates one.
> 
> It's not a good new.
> 
> 
> 
> sys.getsizeof('Gödel'.encode('utf-8'))
> 
> 23
> 
> sys.getsizeof('Gödel'.encode('utf-16-le'))
> 
> 27
> 
> sys.getsizeof('Gödel')
> 
> 42
> 
> os.listdir(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
> 
> ['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
> 
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-8'))
> 
> 61
> 
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-16-le'))
> 
> 79
> 
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
> 
> 100
> 
> 
> 
> jmf

Sorry, wront copy/paste

>>> sys.getsizeof('Gödel'.encode('utf-8'))
23
>>> sys.getsizeof('Gödel'.encode('utf-16-le'))
27
>>> sys.getsizeof('Gödel')
42
>>> os.listdir(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-8'))
61
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-16-le'))
79
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
100
>>>

jmf



More information about the Python-list mailing list