How to waste computer memory?

Sat Mar 19 11:14:08 EDT 2016

On 19/03/2016 14:18, Steven D'Aprano wrote:
> On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters:

>> And occupy somewhere between 50 and 200 bytes? Or is that 400?

>> OK...
>
> You say that as if 400 bytes was a lot.

No, just unpredictable.

> Besides, this is hardly any different from (say) a pure ASCIII version of
> the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes
> in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or
> worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in
> ASCII, for a total of 600 bytes...

Those kinds of problems are well known with ASCII, for example needing 
to compare strings but ignoring case, or treating tabs as spaces. It's 
clear that dealing with those properly goes beyond the remit of basic 
string processing in a language.

With Unicode there are a whole bunch of other problems, and some people 
expect basic string handling to be able to deal with all of them. (I 
think Unicode should be dealt with at the next level up. Then some of 
use can stay at the bottom level that is more efficient and works 99% of 
the time on average, and just about 100% for most.)

> Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the
> space they take up, and memory is cheap.

Which is about 3000 decimal digits, slightly more than 1KB in packed 
binary. In BCD it would be 1.5KB. At one-byte per digit (eg. ASCII) it's 
3KB. At 4 bytes per (eg. UCS4), it's 12KB.

What would you say to someone advocating 12 times as much storage for 
long integers as is used now? After all memory is cheap!

> and my computer calculates and prints the result faster than I can enter
> the calculation in the first place. Worrying about the fact that characters
> use more than 8 bits is oh-so-1990s.

We still need to worry about it. Whatever memory is being used up (ram, 
cache, flash, disk) 16-bit characters will use twice as much, and 32-bit 
half as much. And the bandwidth necessary to access or transmit will 
also be twice or four times as much.

But the existence of UTF-8 means something /has/ been done about it, or 
some of it; somebody /has/ worried about it.

 >The days of thinking that 127
 > characters is all you need (7 bit ASCII) are long, long gone, just 
 >like the
 > days when it was appropriate for ints to be 16 bits.

Some things haven't actually changed that much.

Word sizes might have doubled from 32 bits on a mainframe to 64 bits now 
(temporarily reducing to 8 and 16 along the way for micros and minis).

But the English alphabet still has 26 letters. Keyboards still have 
around 100 keys. And programming languages and text formats still 
predominantly use ASCII subset for their keywords and identifiers.

-- 
Bartc