permuting letters and fairy tales

Fri Nov 12 20:32:21 EST 2004

Johannes Nix <Johannes.Nix at gmx.net> wrote in message news:<67lld8kr76.fsf at aster.homelinux.net>...
> yesterday I met a cute person

Obviously this wouldn't have happened if you were using Perl.

> (after my dance class) who told me about an
> interesting experiment regarding cognition. People were told to read a
> typed text; However, in every word in the text, only the first and the
> last letter were in the right place. The other letters had their
> positions changed in an arbitrary manner. The surprising result, I was
> told, was that people can read this mixed-up text fairly well.

I never entirely bought this, although I've never gone to the
empirical extreme you did.

If you look at the oft-cited example (see
http://www.snopes.com/language/apocryph/cambridge.asp ), you will see
that the intetior letters are not even close to being in random order.
 For the most part, letters near the front of the word remain near the
front, and letters towards the back remain near the back.

My thought is that, the brain doesn't read the word as a whole as they
claim, but neither does it pay too much attention to the _exact_
ordering.  So you could scramble the letters a little, and it will
still be recognizable; but scramble them a lot, and you will have
trouble.

I suspect that a fairy tale wouldn't be too hard even with random
ordering, though, since fairy tales don't use a lot of big words.

> Because I am a somewhat sceptical guy, at times, and because I thought
> that I deserved some play, I decided to code the rule above in a
> scriptlet. The resulting 23 lines are below, and the outcome is quite
> interesting, not only from the point of view of librarians ;-) :
> 
> --------------------------------------------------
> #!/usr/bin/python
> import sys
> import locale
> import string
> import re
> import Numeric
> import RandomArray
> 
> locale.setlocale(locale.LC_CTYPE, '')
> wordsep = re.compile('([^%s])' % string.letters)
> 
> for line in sys.stdin.xreadlines():
>     for word in wordsep.split(line):
>         if word and word[0] in string.letters:
>             word = string.lower(word)
>             wlen = len(word)
>             if wlen > 3:
>                 wa = Numeric.array(word)
>                 perm = RandomArray.permutation(wlen-2)
>                 wa[1:wlen-1] = Numeric.take(wa[1:wlen-1],perm)
>                 word = wa.tostring()
>         sys.stdout.write('%s' % word)
> 
> --------------------------------------------------
>  
> 
> For the Uninitiated, Numeric is a package which deals with array data;
> arrays are mutable sequences and Numeric.take() can reorder items in
> them; RandomArray.permutation() delivers the randomized reordering we
> need.
> 
> Now I have two innocent questions:
> 
> - Is it possible to make it a bit more concise ;-))) ?

Yes, but I'd say you're already at the point where more conciseness
causes an unacceptable hit in the readability.  I suspect you also
thought this.

> - Can it coerced to run a little bit faster ?
>   (on my oldish, 300 MHz-AMD K6 , run time looks like this
>   for a famous, 2663-word-long fairy tale from the Grimm's brothers:
> 
>   nix at aster:~> time <HaenselundGretel.txt ./python/perlmutt.py  >v
>   
>   real    0m6.970s
>   user    0m3.634s
>   sys     0m0.120s

Well, the thing about Numeric and numarray is that it's not designed
to scale down well.  If performance is your concern, you are probably
better off using regular Python lists if your operations act on small
arrays, as they do in this case.  It doesn't surprise me that Steven
Bethard found that ordinary Python ran faster.

However, maybe we can use numarray to a greater advantage.  The best
way to use numarray is to operate on as much data as possible, which
in this case is the whole text file.  The question is: can we use
operations on the whole text file to do this job?  Well, we probably
can't scramble the letters of individual words by operating on the
whole text file at once.  But could we find the word boundaries that
way?  The answer is yes.  Have a look at my snippet to see how:

-----------------------
import numarray as na
from numarray import random_array as ra

def interior_scramble(text):

    # First, build a length 256 table indicating if this is a real
    # letter.  It's a boolean-valued array.

    ltable = na.array([ chr(c).isalpha() for c in range(256) ])

    # Turn the above text into a numarray array We add a space
    # before and after the text, to make it possible to see the
    # boundaries on the first and last words

    atext = na.fromstring(" %s " % text, na.UInt8)

    # Now, calculate a mask.  This uses the binary values of the
    # text string to index into the ltable, so that whereever a
    # letter appears in the text, a one appears in the mask.

    amask = ltable[atext]

    # Here's the magic: to find the word boundaries, we xor two
    # slices that are offset by one.  Learning to use slicing in
    # this way is one of the greatest ways to channel the power of
    # numarray.

    # In the mask array, a start boundary will be a zero followed
    # by a one.  An end will be a one followed by a zero.
    # Therefore, if you xor adjacent entries in the mask array,
    # the result will be nonzero whereever there's a boundary.
    # So, let's xor two one-off slices (which has the effect of
    # xoring adjacent entries) of the mask to get the word
    # boundaries.

    pat = amask[:-1] ^ amask[1:]

    # pat will be nonzero wherever there's a word boundary.  Let's
    # use numarray.nonzero to get the indices of the word
    # boundaries.  The word boundaries will be
    # start,end,start,end,... so we can reshape the array so that
    # each row is a start,end pair.

    words = na.reshape(na.nonzero(pat)[0],(-1,2))

    # Because of how the resulting array aligns, we have to add 2
    # to each start value to get the index where the scrambling
    # begins.

    words[:] += [2,0]

    # We got the word boundaries.  Now, loop through them and
    # scramble.  Because the whole file is a numarray, we don't
    # need to convert each word to a numarray and back
    # individually.  That also saves some time.

    for start,end in words:
        n = end-start
        if n < 2:
            continue
        perm = ra.permutation(end-start)
        atext[start:end] = atext[start:end][perm]

    # It's scrambled.  Convert the applicable slice back to a
    # string.

    return atext[1:-1].tostring()
-----------------

I'm not sure if this is faster than your method or the pure-Python
method; I didn't check (I worked too hard writing it :).  But,
generally speaking, whenever you can operate on your whole data set
using numarray at once, doing so will be a big time savings.

-- 
CARL BANKS