Comparing strings from the back?

Tue Sep 4 19:37:26 EDT 2012

On 4 September 2012 22:59, Chris Angelico <rosuav at gmail.com> wrote:

> On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <dfnsonfsduifb at gmx.de>
> wrote:
> > How do you arrive at that conclusion? When comparing two random strings,
> > I just derived
> >
> > n = (256 / 255) * (1 - 256 ^ (-c))
> >
> > where n is the average number of character comparisons and c. The
> > rationale as follows: The first character has to be compared in any
> > case. The second with a probability of 1/256, the third with 1/(256^2)
> > and so on.
>
> That would be for comparing two random areas of memory. Python strings
> don't have 256 options per character; and in terms of actual strings,
> there's so many possibilities. The strings that a program is going to
> compare for equality are going to use a vastly restricted alphabet;
> for a lot of cases, there might be only a few dozen plausible
> characters.
>

> But even so, it's going to scale approximately linearly with the
> string length. If they're really random, then yes, there's little
> chance that either a 1MB string or a 2MB string will be the same, but
> with real data, they might very well have a long common prefix. So
> it's still going to be more or less O(n).
>

It doesn't matter whether there are 256 possible characters or 2. String
comparisons are best case O(1) and worst case O(n). For the average case we
need to assume the distribution of the strings. Assuming random strings
(with IID characters), even if there are only 2 characters the probability
that all the characters up to the jth compare equal will still decrease
exponentially with n, giving an average case of O(1) comparisons (if the
two characters are equally likely: 1/2 + 2/4 + 3/8 + 4/16 + ... + j / (2 **
j) + ...).

For real strings common prefixes may be more likely but unless the length
of those common prefixes scales with n the average case number of
comparisons required will not be O(n).

The only way to get round the problem of O(1) string comparisons is to use
strings composed entirely from a set of 1 possible character. Not only does
this do away with all of the inherent performance problems of flexible
string representations but it results in O(0) comparison complexity (far
outstripping previous Python versions).

Oscar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120905/71c95c5d/attachment.html>