[I18n-sig] Re: How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Tue, 26 Jun 2001 17:08:19 -0400


> Mark Davis wrote:
> > 
> > That is an interesting approach; one that basically amounts to some
> > convenience functions. For example, instead of writing:
> > 
> > myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));
> > 
> > you could write:
> > 
> > myString.substring(3, 5, myString.CODEPOINT);
> > 
> > This hides some of the work, when someone is working in code points. The
> > performance cost is still there, of course; using code point indexes
> > requires each operation to examine every code unit up to that point, which
> > is much more expensive.
> 
> Good idea !
>  
> > For a general programming language or string library, I'm not sure about
> > implementing this pattern throughout. I know in the ICU library, for
> > example, we have a significant number of functions that take offsets into
> > strings. Having such a parameter on all of them would be clumsy, when most
> > of the time people are simply working in code units.
> 
> In Python this would certainly be an elegant way to add the
> code point indexing functionality (Python supports optional arguments
> with default values).
>  
> -- 
> Marc-Andre Lemburg

I still think this should be an add-on module, to emphasize we're not
eager to do a whole lot of surrogate support.

--Guido van Rossum (home page: http://www.python.org/~guido/)