[Python-Dev] len(chr(i)) = 2?

Fri Nov 26 08:51:35 CET 2010

On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote:

> Greg Ewing writes:
>> On 24/11/10 22:03, Stephen J. Turnbull wrote:
>>> But
>>> if you actually need to remember positions, or regions, to jump to
>>> later or to communicate to other code that manipulates them, doing
>>> this stuff the straightforward way (just copying the whole iterator
>>> object to hang on to its state) becomes expensive.
>> 
>> If the internal representation of a text pointer (I won't call it
>> an iterator because that means something else in Python) is a byte
>> offset or something similar, it shouldn't take up any more space
>> than a Python int, which is what you'd be using anyway if you
>> represented text positions by grapheme indexes or whatever.
> 
> That's not necessarily true.  Eg, in Emacs ("there you go again"),
> Lisp integers are not only immediate (saving one pointer), but the
> type is encoded in the lower bits, so that there is no need for a type
> pointer -- the representation is smaller than the opaque marker type.
> Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of
> 24 bytes on a 64-bit platform.

Yes, yes, lisp is very clever.  Maybe some other runtime, like PyPy, could make this optimization.  But I don't think that anyone is filling up main memory with gigantic piles of character indexes and need to squeeze out that extra couple of bytes of memory on such a tiny object.  Plus, this would allow such a user to stop copying the character data itself just to decode it, and on mostly-ascii UTF-8 text (a common use-case) this is a 2x savings right off the bat.

> In Python it's true that markers can use the same data structure as
> integers and simply provide different methods, and it's arguable that
> Python's design is better.  But if you use bytes internally, then you
> have problems.

No, you just have design questions.

> Do you expose that byte value to the user?

Yes, but only if they ask for it.  It's useful for computing things like quota and the like.

> Can users (programmers using the language and end users) specify positions in terms of byte values?

Sure, why not?

> If so, what do you do if the user specifies a byte value that points into a multibyte character?

Go to the beginning of the multibyte character.  Report that position; if the user then asks the requested marker object for its position, it will report that byte offset, not the originally-requested one.  (Obviously, do the same thing for surrogate pair code points.)

> What if the user wants to specify position by number of characters?

Part of the point that we are trying to make here is that nobody really cares about that use-case.  In order to know anything useful about a position in a text, you have to have traversed to that location in the text. You can remember interesting things like the offsets of starts of lines, or the x/y positions of characters.

> Can you translate efficiently?

No, because there's no point :).  But you _could_ implement an overlay that cached things like the beginning of lines, or the x/y positions of interesting characters.

> As I say elsewhere, it's possible that there really never is a need to efficiently specify an absolute position in a large text as a character (grapheme, whatever) count.

> But I think it would be hard to implement an efficient text-processing *language*, eg, a Python module
> for *full conformance* in handling Unicode, on top of UTF-8.

Still: why?  I guess if I have some free time I'll try my hand at it, and maybe I'll run into a wall and realize you're right :).

> Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation.  At least, that's been my experience with Emacsen.

What sort of algorithm would that be, though?  The main thing that I could think of is a text editor trying to efficiently allow the user to scroll to the middle of a large file without reading the whole thing into memory.  But, in that case, you could use byte-positions to estimate, and display an heuristic number while calculating the real line numbers.  (This is what 'less' does, and it seems to work well.)

>> So I don't really see what you're arguing for here. How do
>> *you* think positions in unicode strings should be represented?
> 
> I think what users should see is character positions, and they should
> be able to specify them numerically as well as via an opaque marker
> object.  I don't care whether that position is represented as bytes or
> characters internally, except that the experience of Emacsen is that
> representation as byte positions is both inefficient and fragile.  The
> representation as character positions is more robust but slightly more
> inefficient.

Is it really the representation as byte positions which is fragile (i.e. the internal implementation detail), or the exposure of that position to calling code, and the idiomatic usage of that number as an integer?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20101126/8455b449/attachment.html>