Comparing strings from the back?

Steve Howell showell30 at yahoo.com
Fri Sep 14 20:51:07 EDT 2012


On Sep 6, 4:04 am, Oscar Benjamin <oscar.j.benja... at gmail.com> wrote:
> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d... at davea.name> wrote:
> > For random strings (as defined below), the average compare time is
> > effectively unrelated to the size of the string, once the size
> passes
> > some point.
> > Define random string as being a selection from a set of characters,
> with
> > replacement.  So if we pick some set of characters, say 10 (or 256,
> it
> > doesn't really matter), the number of possible strings is 10**N.
> > The likelihood of not finding a mismatch within k characters is
> > (1/10)**k   The likelihood of actually reaching the end of the
> random
> > string is (1/10)**N.  (which is the reciprocal of the number of
> strings,
> > naturally)
> > If we wanted an average number of comparisons, we'd have to
> calculate a
> > series, where each term is a probability times a value for k.
> >    sum((k * 9*10**-k) for k in range(1, 10))
> > Those terms very rapidly approach 0, so it's safe to stop after a
> few.
> > Looking at the first 9 items, I see a value of 1.1111111
> > This may not be quite right, but the value is certainly well under
> 2 for
> > a population of 10 characters, chosen randomly.  And notice that N
> > doesn't really come into it.
>
> It's exactly right. You can obtain this result analytically from
> Johannes' formula above. Just replace 256 with 10 to get that the
> expected number of comparisons is
>
> (10/9)*(1 - 10**(-N))

The math here is simple, but of course the average running time is
also driven by usage patterns.

Let's say you have two independently produced strings with N number of
random bits.  The average amount of bit comparisons that it takes to
determine that the strings are different is between 1 and 2 checks for
any value of N, for the same reason that the average number of times
you need to flip a coin before either coming up heads or exhausting
your N tries is always less than 2, and pretty darn close to 2 for
even relatively small values of N.

    def compute_checks_for_different_strings(n):
      x = 1
      p = 0.5
      for i in range(0, n):
        x += p * i
        p = p * 0.5
      print n, x

    for n in [10, 100, 1000, 10000]:
      compute_checks_for_different_strings(n)

     10 1.9892578125
     100 2.0
     1000 2.0
     10000 2.0


Now let's say your string comparison function is used to verify 100-
bit passwords, and 99% of the time the password is from a legit user
who types it correctly.  Except for the 1% of a time when somebody fat
fingers the copy/paste of their password, every strcmp call is gonna
have to compare 100 pairs of characters, or N pairs.  If the usage
pattern says that 99% of the time you're simply verifying that two
strings are equal, then the number of bits is the main driving factor
for running time.










More information about the Python-list mailing list