How to waste computer memory?

Fri Mar 18 13:06:41 EDT 2016

On Fri, Mar 18, 2016, at 12:44, Steven D'Aprano wrote:
> And I don't understand this meme that indexing strings is not important.
> Have people never (say) taken a slice of a string, or a look-ahead, or
> something similar?
> 
> i = mystring.find(":")

find is already O(N).

> next_char = mystring[i+1]
> 
> # Strip the first and last chars from a string 
> mystring[1:-1]

slicing is already O(N) in the size of the slice... adding O(N) in your
indices (which are =1) isn't a significant addition.

> >> It's not the only drawback, either. If you want to know anything about
> >> the characters in the string that you're looking at, you need to know
> >> their codepoints.
> > 
> > Nonsense. That depends on what you want to know about it. You can
> > extract a single character from a string, as a string, without knowing
> > anything about it except what range the first byte is in. You can use
> > this string directly as an index to a hash table containing information
> > such as unicode properties, names, etc.
> 
> I don't understand your comment. If I give you the index of the
> character,
> how do you know where its first byte is?

Er, I thought we were talking about the assertion that you can't do
anything with the character you *already have* the byte index for
without decoding it to a code point. My point is that you can determine
the number of bytes in the character without decoding it (fully), you
only need to look at the first byte. Especially if all strings are
guaranteed to be valid UTF-8.

Look at first byte. First bit is 0, so it's only one byte.
Look at second byte. First three bits are 110, so it's two bytes.
Look at fourth byte. First four bits are 1110, so it's three bytes.
Look at seventh byte. First four bits are 1111, so it's four bytes.

For this, you've never looked at the last four bits of any of those
bytes, or any bits of any of the other bytes.

For iteration, you could simply count how many bytes you encounter whose
first two bits aren't 10, until you reach the desired number. Simpler
algorithm, and works forward and backward. You only need to do what I
mentioned above to extract a character.

My point is, neither process requires you to assemble all the bits into
a complete codepoint.

> With UTF-8, character i can be
> anywhere between byte i and 4*i.