How to waste computer memory?

Fri Mar 18 12:44:33 EDT 2016

On Sat, 19 Mar 2016 02:31 am, Random832 wrote:

> On Fri, Mar 18, 2016, at 11:17, Ian Kelly wrote:
>> > Just to play devil's advocate, here, why is it so bad for indexing to
>> > be O(n)? Some simple caching is all that's needed to prevent it from
>> > making iteration O(n^2), if that's what you're worried about.
>> 
>> What kind of caching do you have in mind?
> 
> The byte index of the last character index accessed. 

Some people, when faced with a problem, think, I know, I'll use a cache. Now
they have 99 problems (but a bitch ain't one).

> When accessing a 
> new index greater than that one, start from there (_maybe_ also support
> iterating backwards in this way, if accessing an index that is much
> closer to the cached one than to zero). And even that's only necessary
> if you actually _care_ about forward iteration by character indices
> (i.e. "for i in range(len(s))"), rather than writing it off as bad
> coding style.

Without locking, this kills thread safety for strings. With locking, it
probably kills performance. Worse, even in single-threaded code, functions
can mess up the cache, destroying any usefulness it may have:

x = mystring[9999]  # sets the cache to index 9999
spam(mystring)      # calls mystring[0], setting the cache back to 0
y = mystring[10000]

And I don't understand this meme that indexing strings is not important.
Have people never (say) taken a slice of a string, or a look-ahead, or
something similar?

i = mystring.find(":")
next_char = mystring[i+1]

# Strip the first and last chars from a string 
mystring[1:-1]

Perhaps you might argue that for some applications, O(N) string operations
don't matter. After all, in Linux, terminals usually are set to UTF-8, and
people can copy and paste substrings out of the buffer, etc. So there may
be a case to say that for applications with line-oriented buffers typically
containing 70 or 170 characters per line, like a terminal, UTF-8 is
perfectly adequate. I have no argument against that.

But I think that for a programming language which may be dealing with
strings of multiple tens of megabytes in size, I'm skeptical that UTF-8
doesn't cause a performance hit for at least some operations.

>> It's not the only drawback, either. If you want to know anything about
>> the characters in the string that you're looking at, you need to know
>> their codepoints.
> 
> Nonsense. That depends on what you want to know about it. You can
> extract a single character from a string, as a string, without knowing
> anything about it except what range the first byte is in. You can use
> this string directly as an index to a hash table containing information
> such as unicode properties, names, etc.

I don't understand your comment. If I give you the index of the character,
how do you know where its first byte is? With UTF-8, character i can be
anywhere between byte i and 4*i.

(And by character I actually mean code point -- let's not get into arguments
about normalisation.)

>> If the string is simple UCS-2, that's easy. 

Hmmm, well, nobody uses UCS-2 any more, since that only covers the first
65536 code points. Rather, languages like Javascript and Java, and the
Windows OS, use UTF-16, which is a *variable width* extension to UCS-2. I
don't know about Windows, but Javascript implements this badly, so that
4-byte UTF-16 code points are treated as *two* surrogate code points
instead of the single code point they are meant to be. We can get the same
result in Python 2.7 narrow builds:

py> s = u'\U0010FFFF'  # definitely a single code point
py> len(s)  # WTF?
2
py> s[0]  # a surrogate.
u'\udbff'
py> s[1]  # another surrogate
u'\udfff'

The only fixed-width encoding of the entire Unicode character set is UTF-32.

-- 
Steven