Python usage numbers

Roy Smith roy at panix.com
Sun Feb 12 20:11:04 EST 2012


In article <mailman.5750.1329094801.27778.python-list at python.org>,
 Chris Angelico <rosuav at gmail.com> wrote:

> The advantage, though, is that you can always know how many bytes to
> read for X characters. In ASCII, you allocate 80 bytes of storage and
> you can store 80 characters. In UTF-8, if you want an 80-character
> buffer, you can probably get away with allocating 240 characters...
> but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you
> know you can store them. Also, you know exactly where the 17th
> character is; in UTF-8, you have to count. That's a huge advantage for
> in-memory strings; but is it useful on disk, where (as likely as not)
> you're actually looking for lines, which you still have to scan for?
> I'm thinking not, so it makes sense to use a smaller disk image than
> UTF-32 - less total bytes means less sectors to read/write, which
> translates fairly directly into performance.

You might just write files compressed.  My guess is that a typical 
gzipped UTF-32 text file will be smaller than the same data stored as 
uncompressed UTF-8.



More information about the Python-list mailing list