FSR and unicode compliance - was Re: RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Mon Jul 29 09:36:16 EDT 2013


Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit :
> On 7/28/2013 11:52 AM, Michael Torrie wrote:
> 
> >
> 
> > 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
> 
> > slicing a string would be very very slow,
> 
> 
> 
> Not necessarily so. See below.
> 
> 
> 
> > and that's unacceptable for
> 
> > the use cases of python strings.  I'm assuming you understand big O
> 
> > notation, as you talk of experience in many languages over the years.
> 
> > FSR and UTF-32 both are O(1) for slicing and lookups.
> 
> 
> 
> Slicing is at least O(m) where m is the length of the slice.
> 
> 
> 
> > UTF-8, 16 and any variable-width encoding are always O(n).\
> 
> 
> 
> I posted about a week ago, in response to Chris A., a method by which 
> 
> lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, 
> 
> O(1+log2(k+1)), where k is the number of non-BMP chars in the string.
> 
> 
> 
> This uses an auxiliary array of k ints. An auxiliary array of n ints 
> 
> would make UFT-16 lookup O(1), but then one is using more space than 
> 
> with UFT-32. Similar comments apply to UTF-8.
> 
> 
> 
> The unicode standard says that a single strings should use exactly one 
> 
> coding scheme. It does *not* say that all strings in an application must 
> 
> use the same scheme. I just rechecked a few days ago. It also does not 
> 
> say that an application cannot associate additional data with a string 
> 
> to make processing of the string easier.
> 
> 
> 
> -- 
> 
> Terry Jan Reedy

To my knowledge, the Unicode doc always speak about
the misc. utf* coding schemes in an "exclusive or" way.

Having multiple encoded strings is one thing. Manipulating
multiple encoded strings is something else.

Maybe the mistake was to not emphasize the fact that
one has to work with a unique set of encoded code points
(utf-8 or utf-16 or utf-32) because it was considered,
as to obvious one can not work properly with multiple
coding schemes.

You are also right in saying " ...application cannot associate
additional data...".
The doc does not specify it either. It is superfleous.


jmf




More information about the Python-list mailing list