Anagrams

Tue Oct 23 17:12:30 EDT 2007

On Oct 23, 3:21 am, cokofree... at gmail.com wrote:
> This was from a random website I found on practising good programming
> techniques and I thought I'd see what ways people could find to write
> out this example. Below are my two examples (which should work...).
>
> I am merely interested in other techniques people have (without
> resorting to overusage of C modules and such like), just python and
> good computer science techniques.
>
> For the wordlist go to this link <a href="http://codekata.pragprog.com/
> 2007/01/kata_six_anagra.html">Kata Anagrams</a>
> Anyway here are my two examples.
>
> This one uses a dictionary to store prime values of each letter in the
> alphabet and for each line multiple the results of the characters
> (which is unique for each anagram) and add them to a dictionary for
> printing latter.
> <pre>
> def anagram_finder():
>     primeAlpha = {'a':2, 'b':3, 'c':5, 'd':7,'e' : 11, 'f':13, 'g':17,
> 'h':19,        \
>                   'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43, 'o':
> 47, 'p':53,    \
>                   'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':
> 83, 'x':89,    \
>                   'y':97, 'z':101}
>     dictAna =  {}
>     file = open ("wordlist.txt", "r")
>     for line in file:
>         value = 1
>         for s in line:
>             if s.lower() in primeAlpha:
>                    value *= primeAlpha[s.lower()]
>         dictAna.setdefault(value, []).append(line.strip())
>     file.close()
>     print "\n".join(['Anagrams are: %s' % (v) for k, v in
> dictAna.items() if len(dictAna[k]) > 1])
> </pre>
>
> My second is a little bit simpler. Each dictionary key is an
> alphabetical ordering of the line in question, so that anagrams will
> appear the same. It will add to the dictionary the new word either in
> an existing key, or create a new one for it.
>
> <pre>
> def anagram_finder():
>     dict = {}
>     file = ('wordlist.txt', 'r')
>     for line in file:
>                 strip_line = line.strip().lower()
>                 sort_line = str(sorted(strip_line))
>                 dict.setdefault(sort_line, []).append(strip_line)
>     file.close()
>     print '\n'.join(['Anagrams are: %s' % (v) for k, v in dict.items()
> if len(dict[k]) > 1])
> </pre>
>
> Comparing them with timeit, they both took around 1 second or so, with
> the first being slightly faster.
>
> What other ways do people think will work (and do mine actually work
> for other people!)

##  I took a somewhat different approach. Instead of in a file,
##  I've got my word list (562456 words) in an MS-Access database.
##  And instead of calculating the signature on the fly, I did it
##  once and added the signature as a second field:
##
##  TABLE CONS_alpha_only_signature_unique
##  --------------------------------------
##  CONS       text      75
##  signature  text      26
##
##  The signature is a 26 character string where each character is
##  the count of occurences of the matching letter. Luckily, in
##  only a single case was there more than 9 occurences of any
##  given letter, which turned not to be a word but a series of
##  words concatenated so I just deleted it from the database
##  (lots of crap in the original word list I used).
##
##  Example:
##
##  CONS     signature
##  aah      20000001000000000000000000 # 'a' occurs twice & 'h' once
##  aahed    20011001000000000000000000
##  aahing   20000011100001000000000000
##  aahs     20000001000000000010000000
##  aaii     20000000200000000000000000
##  aaker    20001000001000000100000000
##  aal      20000000000100000000000000
##  aalborg  21000010000100100100000000
##  aalesund
20011000000101000010100000
##
##  Any words with identical signatures must be anagrams.
##
##  Once this was been set up, I wrote a whole bunch of queries
##  to use this table. I use the normal Access drag and drop
##  design, but the SQL can be extracted from each, so I can
##  simply open the query from Python or I can grab the SQL
##  and build it inside the program. The query
##
##    signatures_anagrams_select_signature
##
##  is hard coded for criteria 9 & 10 and should be cast inside
##  Python so the criteria can be changed dynamically.
##
##
##  QUERY signatures_anagrams_longest
##  ---------------------------------
##  SELECT   Len([CONS]) AS Expr1,
##           Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
##           Cons_alpha_only_signature_unique.signature
##  FROM     Cons_alpha_only_signature_unique
##  GROUP BY Len([CONS]),
##           Cons_alpha_only_signature_unique.signature
##  HAVING   (((Count(Cons_alpha_only_signature_unique.CONS))>1))
##  ORDER BY Len([CONS]) DESC ,
##           Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
##  This is why I don't use SQLite3, must have crosstab queries.
##
##  QUERY signatures_anagram_summary
##  --------------------------------
##  TRANSFORM Count(signatures_anagrams_longest.signature) AS
CountOfsignature
##  SELECT    signatures_anagrams_longest.Expr1 AS [length of word]
##  FROM      signatures_anagrams_longest
##  GROUP BY  signatures_anagrams_longest.Expr1
##  PIVOT     signatures_anagrams_longest.CountOfCONS;
##
##
##  QUERY signatures_anagrams_select_signature
##  ------------------------------------------
##  SELECT   Len([CONS]) AS Expr1,
##           Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
##           Cons_alpha_only_signature_unique.signature
##  FROM     Cons_alpha_only_signature_unique
##  GROUP BY Len([CONS]),
##           Cons_alpha_only_signature_unique.signature
##  HAVING   (((Len([CONS]))=9) AND
##            ((Count(Cons_alpha_only_signature_unique.CONS))=10))
##  ORDER BY Len([CONS]) DESC ,
##           Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
##  QUERY signatures_lookup_by_anagram_select_signature
##  ---------------------------------------------------
##  SELECT     signatures_anagrams_select_signature.Expr1,
##             signatures_anagrams_select_signature.CountOfCONS,
##             Cons_alpha_only_signature_unique.CONS,
##             Cons_alpha_only_signature_unique.signature
##  FROM       signatures_anagrams_select_signature
##  INNER JOIN Cons_alpha_only_signature_unique
##  ON         signatures_anagrams_select_signature.signature
##             = Cons_alpha_only_signature_unique.signature;
##
##
##  Now it's a simple matter to use the ODBC from Win32 to extract
##  the query output into Python.

import dbi
import odbc

con = odbc.odbc("words")
cursor = con.cursor()

##  This first section grabs the anagram summary. Note that
##  queries act just like tables (as long as they don't have
##  internal dependencies). I read somewhere you can get the
##  field names, but here I put them in by hand.

##cursor.execute("SELECT * FROM signature_anagram_summary")
##
##results = cursor.fetchall()
##
##for i in results:
##  for j in i:
##    print '%4s' % (str(j)),
##  print

##  (if this wraps, each line is 116 characters)
##        2    3    4    5    6    7    8    9   10   11   12   13
14   15   16   17   18   23
##   2  259 None None None None None None None None None None None
None None None None None None
##   3  487  348  218  150  102 None None None None None None None
None None None None None None
##   4 1343  718  398  236  142  101   51   26   25    9    8    3
2 None None None None None
##   5 3182 1424  777  419  274  163  106   83   53   23   20   10
6    4    5    1    3    1
##   6 5887 2314 1051  545  302  170  114   54   43   21   15    6
5    4    4    2 None None
##   7 7321 2251  886  390  151   76   49   37   14    7    5    1
1    1 None None None None
##   8 6993 1505  452  166   47   23    8    6    4    2    2 None
None None None None None None
##   9 5127  830  197   47   17    6 None None    1 None None None
None None None None None None
##  10 2975  328   66    8    2 None None None None None None None
None None None None None None
##  11 1579  100    5    4    2 None None None None None None None
None None None None None None
##  12  781   39    2    1 None None None None None None None None
None None None None None None
##  13  326   11    2 None None None None None None None None None
None None None None None None
##  14  166    2 None None None None None None None None None None
None None None None None None
##  15   91 None    1 None None None None None None None None None
None None None None None None
##  16   60 None None None None None None None None None None None
None None None None None None
##  17   35 None None None None None None None None None None None
None None None None None None
##  18   24 None None None None None None None None None None None
None None None None None None
##  19   11 None None None None None None None None None None None
None None None None None None
##  20    6 None None None None None None None None None None None
None None None None None None
##  21    6 None None None None None None None None None None None
None None None None None None
##  22    4 None None None None None None None None None None None
None None None None None None

##  From the query we have the word size as row header and size of
##  anagram set as column header. The data value is the count of
##  how many different anagram sets match the row/column header.
##
##  For example, there are 7321 different 7-letter signatures that
##  have 2 anagram sets. There is 1 5-letter signature having a
##  23 member anagram set.
##
##  We can then pick any of these, say the single 10 member anagram
##  set of 9-letter words, and query out out the anagrams:

cursor.execute("SELECT * FROM
signatures_lookup_by_anagram_select_signature")
results = cursor.fetchall()
for i in results:
  for j in i:
    print j,
  print

##  9 10 anoretics 10101000100001100111000000
##  9 10 atroscine 10101000100001100111000000
##  9 10 certosina 10101000100001100111000000
##  9 10 creations 10101000100001100111000000
##  9 10 narcotise 10101000100001100111000000
##  9 10 ostracine 10101000100001100111000000
##  9 10 reactions 10101000100001100111000000
##  9 10 secration 10101000100001100111000000
##  9 10 tinoceras 10101000100001100111000000
##  9 10 tricosane 10101000100001100111000000

## Nifty, eh?