[Python-Dev] len(chr(i)) = 2?

Thu Nov 25 04:55:40 CET 2010

Greg Ewing writes:
 > On 24/11/10 22:03, Stephen J. Turnbull wrote:
 > > But
 > > if you actually need to remember positions, or regions, to jump to
 > > later or to communicate to other code that manipulates them, doing
 > > this stuff the straightforward way (just copying the whole iterator
 > > object to hang on to its state) becomes expensive.
 > 
 > If the internal representation of a text pointer (I won't call it
 > an iterator because that means something else in Python) is a byte
 > offset or something similar, it shouldn't take up any more space
 > than a Python int, which is what you'd be using anyway if you
 > represented text positions by grapheme indexes or whatever.

That's not necessarily true.  Eg, in Emacs ("there you go again"),
Lisp integers are not only immediate (saving one pointer), but the
type is encoded in the lower bits, so that there is no need for a type
pointer -- the representation is smaller than the opaque marker type.
Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of
24 bytes on a 64-bit platform.

In Python it's true that markers can use the same data structure as
integers and simply provide different methods, and it's arguable that
Python's design is better.  But if you use bytes internally, then you
have problems.  Do you expose that byte value to the user?  Can users
(programmers using the language and end users) specify positions in
terms of byte values?  If so, what do you do if the user specifies a
byte value that points into a multibyte character?  What if the user
wants to specify position by number of characters?  Can you translate
efficiently?

As I say elsewhere, it's possible that there really never is a need to
efficiently specify an absolute position in a large text as a
character (grapheme, whatever) count.  But I think it would be hard to
implement an efficient text-processing *language*, eg, a Python module
for *full conformance* in handling Unicode, on top of UTF-8.  Any time
you have an algorithm that requires efficient access to arbitrary text
positions, you'll spend all your skull sweat fighting the
representation.  At least, that's been my experience with Emacsen.

 > So I don't really see what you're arguing for here. How do
 > *you* think positions in unicode strings should be represented?

I think what users should see is character positions, and they should
be able to specify them numerically as well as via an opaque marker
object.  I don't care whether that position is represented as bytes or
characters internally, except that the experience of Emacsen is that
representation as byte positions is both inefficient and fragile.  The
representation as character positions is more robust but slightly more
inefficient.