[Python-Dev] len(chr(i)) = 2?

Glyph Lefkowitz glyph at twistedmatrix.com
Wed Nov 24 04:27:38 CET 2010


On Nov 23, 2010, at 9:44 PM, Stephen J. Turnbull wrote:

> James Y Knight writes:
> 
>> You put a smiley, but, in all seriousness, I think that's actually
>> the right thing to do if anyone writes a new programming
>> language. It is clearly the right thing if you don't have to be
>> concerned with backwards-compatibility: nobody really needs to be
>> able to access the Nth codepoint in a string in constant time, so
>> there's not really any point in storing a vector of codepoints.
> 
> A sad commentary on the state of Emacs usage, "nobody".
> 
> The theory is that accessing the first character of a region in a
> string often occurs as a primitive operation in O(N) or worse
> algorithms, sometimes without enough locality at the "collection of
> regions" level to give a reasonably small average access time.

I'm not sure what you mean by "the theory is".  Whose theory?  About what?

> In practice, any *Emacs user can tell you that yes, we do need to be
> able to access the Nth codepoint in a buffer in constant time.  The
> O(N) behavior of current Emacs implementations means that people often
> use a binary coding system on large files.  Yes, some position caching
> is done, but if you have a large file (eg, a mail file) which is
> virtually segmented using pointers to regions, locality gets lost.
> (This is not a design bug, this is a fundamental requirement: consider
> fast switching between threaded view and author-sorted view.)

Sounds like a design bug to me.  Personally, I'd implement "fast switching between threaded view and author-sorted view" the same way I'd address any other multiple-views-on-the-same-data problem.  I'd retain data structures for both, and update them as the underlying model changed.

These representations may need to maintain cursors into the underlying character data, if they must retain giant wads of character data as an underlying representation (arguably the _main_ design bug in Emacs, that it encourages you to do that for everything, rather than imposing a sensible structure), but those cursors don't need to be code-point counters; they could be byte offsets, or opaque handles whose precise meaning varied with the potentially variable underlying storage.

Also, please remember that Emacs couldn't be implemented with giant Python strings anyway: crucially, all of this stuff is _mutable_ in Emacs.

> And of course an operation that sorts regions in a buffer using
> character pointers will have the same problem.  Working with memory
> pointers, OTOH, sucks more than that; GNU Emacs recently bit the
> bullet and got rid of their higher-level memory-oriented APIs, all of
> the Lisp structures now work with pointers, and only the very
> low-level structures know about character-to-memory pointer
> translation.
> 
> This performance issue is perceptible even on 3GHz machines with not
> so large (50MB) mbox files.  It's *horrid* if you do something like
> "occur" on a 1GB log file, then try randomly jumping to detected log
> entries.

Case in point: "occur" needs to scan the buffer anyway; you can't do better than linear time there.  So you're going to iterate through the buffer, using one of the techniques that James proposed, and remember some locations.  Why not just have those locations be opaque cursors into your data?

In summary: you're right, in that James missed a spot.  You need bidirectional, *copyable* iterators that can traverse the string by byte, codepoint, grapheme, or decomposed glyph.



More information about the Python-Dev mailing list