Python usage numbers

Chris Angelico rosuav at gmail.com
Sun Feb 12 19:59:57 EST 2012


On Mon, Feb 13, 2012 at 11:03 AM, Dave Angel <d at davea.name> wrote:
> On 02/12/2012 06:29 PM, Steven D'Aprano wrote:
>> I think you mean 4 times as many bytes as characters. Unless you have 32
>> bit bytes :)
>>
>>
> Until you have 32 bit bytes, you'll continue to have encodings, even if only
> a couple of them.

The advantage, though, is that you can always know how many bytes to
read for X characters. In ASCII, you allocate 80 bytes of storage and
you can store 80 characters. In UTF-8, if you want an 80-character
buffer, you can probably get away with allocating 240 characters...
but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you
know you can store them. Also, you know exactly where the 17th
character is; in UTF-8, you have to count. That's a huge advantage for
in-memory strings; but is it useful on disk, where (as likely as not)
you're actually looking for lines, which you still have to scan for?
I'm thinking not, so it makes sense to use a smaller disk image than
UTF-32 - less total bytes means less sectors to read/write, which
translates fairly directly into performance.

ChrisA



More information about the Python-list mailing list