Anagrams
mensanator at aol.com
mensanator at aol.com
Tue Oct 23 18:10:16 EDT 2007
On Oct 23, 4:12 pm, "mensana... at aol.com" <mensana... at aol.com> wrote:
> On Oct 23, 3:21 am, cokofree... at gmail.com wrote:
>
>
>
>
>
> > This was from a random website I found on practising good programming
> > techniques and I thought I'd see what ways people could find to write
> > out this example. Below are my two examples (which should work...).
>
> > I am merely interested in other techniques people have (without
> > resorting to overusage of C modules and such like), just python and
> > good computer science techniques.
>
> > For the wordlist go to this link <a href="http://codekata.pragprog.com/
> > 2007/01/kata_six_anagra.html">Kata Anagrams</a>
> > Anyway here are my two examples.
>
> > This one uses a dictionary to store prime values of each letter in the
> > alphabet and for each line multiple the results of the characters
> > (which is unique for each anagram) and add them to a dictionary for
> > printing latter.
> > <pre>
> > def anagram_finder():
> > primeAlpha = {'a':2, 'b':3, 'c':5, 'd':7,'e' : 11, 'f':13, 'g':17,
> > 'h':19, \
> > 'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43, 'o':
> > 47, 'p':53, \
> > 'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':
> > 83, 'x':89, \
> > 'y':97, 'z':101}
> > dictAna = {}
> > file = open ("wordlist.txt", "r")
> > for line in file:
> > value = 1
> > for s in line:
> > if s.lower() in primeAlpha:
> > value *= primeAlpha[s.lower()]
> > dictAna.setdefault(value, []).append(line.strip())
> > file.close()
> > print "\n".join(['Anagrams are: %s' % (v) for k, v in
> > dictAna.items() if len(dictAna[k]) > 1])
> > </pre>
>
> > My second is a little bit simpler. Each dictionary key is an
> > alphabetical ordering of the line in question, so that anagrams will
> > appear the same. It will add to the dictionary the new word either in
> > an existing key, or create a new one for it.
>
> > <pre>
> > def anagram_finder():
> > dict = {}
> > file = ('wordlist.txt', 'r')
> > for line in file:
> > strip_line = line.strip().lower()
> > sort_line = str(sorted(strip_line))
> > dict.setdefault(sort_line, []).append(strip_line)
> > file.close()
> > print '\n'.join(['Anagrams are: %s' % (v) for k, v in dict.items()
> > if len(dict[k]) > 1])
> > </pre>
>
> > Comparing them with timeit, they both took around 1 second or so, with
> > the first being slightly faster.
>
> > What other ways do people think will work (and do mine actually work
> > for other people!)
>
> ## I took a somewhat different approach. Instead of in a file,
> ## I've got my word list (562456 words) in an MS-Access database.
> ## And instead of calculating the signature on the fly, I did it
> ## once and added the signature as a second field:
> ##
> ## TABLE CONS_alpha_only_signature_unique
> ## --------------------------------------
> ## CONS text 75
> ## signature text 26
> ##
> ## The signature is a 26 character string where each character is
> ## the count of occurences of the matching letter. Luckily, in
> ## only a single case was there more than 9 occurences of any
> ## given letter, which turned not to be a word but a series of
> ## words concatenated so I just deleted it from the database
> ## (lots of crap in the original word list I used).
> ##
> ## Example:
> ##
> ## CONS signature
> ## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once
> ## aahed 20011001000000000000000000
> ## aahing 20000011100001000000000000
> ## aahs 20000001000000000010000000
> ## aaii 20000000200000000000000000
> ## aaker 20001000001000000100000000
> ## aal 20000000000100000000000000
> ## aalborg 21000010000100100100000000
> ## aalesund
> 20011000000101000010100000
> ##
> ## Any words with identical signatures must be anagrams.
> ##
> ## Once this was been set up, I wrote a whole bunch of queries
> ## to use this table. I use the normal Access drag and drop
> ## design, but the SQL can be extracted from each, so I can
> ## simply open the query from Python or I can grab the SQL
> ## and build it inside the program. The query
> ##
> ## signatures_anagrams_select_signature
> ##
> ## is hard coded for criteria 9 & 10 and should be cast inside
> ## Python so the criteria can be changed dynamically.
> ##
> ##
> ## QUERY signatures_anagrams_longest
> ## ---------------------------------
> ## SELECT Len([CONS]) AS Expr1,
> ## Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM Cons_alpha_only_signature_unique
> ## GROUP BY Len([CONS]),
> ## Cons_alpha_only_signature_unique.signature
> ## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1))
> ## ORDER BY Len([CONS]) DESC ,
> ## Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ## This is why I don't use SQLite3, must have crosstab queries.
> ##
> ## QUERY signatures_anagram_summary
> ## --------------------------------
> ## TRANSFORM Count(signatures_anagrams_longest.signature) AS
> CountOfsignature
> ## SELECT signatures_anagrams_longest.Expr1 AS [length of word]
> ## FROM signatures_anagrams_longest
> ## GROUP BY signatures_anagrams_longest.Expr1
> ## PIVOT signatures_anagrams_longest.CountOfCONS;
> ##
> ##
> ## QUERY signatures_anagrams_select_signature
> ## ------------------------------------------
> ## SELECT Len([CONS]) AS Expr1,
> ## Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM Cons_alpha_only_signature_unique
> ## GROUP BY Len([CONS]),
> ## Cons_alpha_only_signature_unique.signature
> ## HAVING (((Len([CONS]))=9) AND
> ## ((Count(Cons_alpha_only_signature_unique.CONS))=10))
> ## ORDER BY Len([CONS]) DESC ,
> ## Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ## QUERY signatures_lookup_by_anagram_select_signature
> ## ---------------------------------------------------
> ## SELECT signatures_anagrams_select_signature.Expr1,
> ## signatures_anagrams_select_signature.CountOfCONS,
> ## Cons_alpha_only_signature_unique.CONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM signatures_anagrams_select_signature
> ## INNER JOIN Cons_alpha_only_signature_unique
> ## ON signatures_anagrams_select_signature.signature
> ## = Cons_alpha_only_signature_unique.signature;
> ##
> ##
> ## Now it's a simple matter to use the ODBC from Win32 to extract
> ## the query output into Python.
>
> import dbi
> import odbc
>
> con = odbc.odbc("words")
> cursor = con.cursor()
>
> ## This first section grabs the anagram summary. Note that
> ## queries act just like tables (as long as they don't have
> ## internal dependencies). I read somewhere you can get the
> ## field names, but here I put them in by hand.
>
> ##cursor.execute("SELECT * FROM signature_anagram_summary")
> ##
> ##results = cursor.fetchall()
> ##
> ##for i in results:
> ## for j in i:
> ## print '%4s' % (str(j)),
> ## print
>
> ## (if this wraps, each line is 116 characters)
Sorry, s/b 96 characters, I was looking at the line number.
Anyway, each line starts with a double hash, so it's easy
to fix.
> ## 2 3 4 5 6 7 8 9 10 11 12 13
> 14 15 16 17 18 23
> ## 2 259 None None None None None None None None None None None
> None None None None None None
> ## 3 487 348 218 150 102 None None None None None None None
> None None None None None None
> ## 4 1343 718 398 236 142 101 51 26 25 9 8 3
> 2 None None None None None
> ## 5 3182 1424 777 419 274 163 106 83 53 23 20 10
> 6 4 5 1 3 1
> ## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6
> 5 4 4 2 None None
> ## 7 7321 2251 886 390 151 76 49 37 14 7 5 1
> 1 1 None None None None
> ## 8 6993 1505 452 166 47 23 8 6 4 2 2 None
> None None None None None None
> ## 9 5127 830 197 47 17 6 None None 1 None None None
> None None None None None None
> ## 10 2975 328 66 8 2 None None None None None None None
> None None None None None None
> ## 11 1579 100 5 4 2 None None None None None None None
> None None None None None None
> ## 12 781 39 2 1 None None None None None None None None
> None None None None None None
> ## 13 326 11 2 None None None None None None None None None
> None None None None None None
> ## 14 166 2 None None None None None None None None None None
> None None None None None None
> ## 15 91 None 1 None None None None None None None None None
> None None None None None None
> ## 16 60 None None None None None None None None None None None
> None None None None None None
> ## 17 35 None None None None None None None None None None None
> None None None None None None
> ## 18 24 None None None None None None None None None None None
> None None None None None None
> ## 19 11 None None None None None None None None None None None
> None None None None None None
> ## 20 6 None None None None None None None None None None None
> None None None None None None
> ## 21 6 None None None None None None None None None None None
> None None None None None None
> ## 22 4 None None None None None None None None None None None
> None None None None None None
>
> ## From the query we have the word size as row header and size of
> ## anagram set as column header. The data value is the count of
> ## how many different anagram sets match the row/column header.
> ##
> ## For example, there are 7321 different 7-letter signatures that
> ## have 2 anagram sets. There is 1 5-letter signature having a
> ## 23 member anagram set.
> ##
> ## We can then pick any of these, say the single 10 member anagram
> ## set of 9-letter words, and query out out the anagrams:
>
> cursor.execute("SELECT * FROM
> signatures_lookup_by_anagram_select_signature")
> results = cursor.fetchall()
> for i in results:
> for j in i:
> print j,
> print
>
> ## 9 10 anoretics 10101000100001100111000000
> ## 9 10 atroscine 10101000100001100111000000
> ## 9 10 certosina 10101000100001100111000000
> ## 9 10 creations 10101000100001100111000000
> ## 9 10 narcotise 10101000100001100111000000
> ## 9 10 ostracine
>
> read more »- Hide quoted text -
>
> - Show quoted text -...
More information about the Python-list
mailing list