Open source English dictionary to use programmatically w/ python

Tue Jan 8 13:25:35 EST 2008

On Jan 7, 3:50 pm, "mensana... at aol.com" <mensana... at aol.com> wrote:
> On Jan 7, 5:10 pm, dgoldsmith_89 <d.l.goldsm... at gmail.com> wrote:
>
>
>
> > On Jan 7, 2:54 pm, "mensana... at aol.com" <mensana... at aol.com> wrote:
>
> > > On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm... at gmail.com> wrote:
>
> > > > Can anyone point me to a downloadable open source English dictionary
> > > > suitable for programmatic use with python: I'm programming a puzzle
> > > > generator, and I need to be able to generate more or less complete
> > > > lists of English words, alphabetized.  Thanks!  DG
>
> > >www.puzzlers.orghasnumerousword lists & dictionarys in text
> > > format that can be downloaded. I recommend you insert them into
> > > some form of database. I have most of them in an Access db and
> > > it's 95 MB. That's a worse case as I also have some value-added
> > > stuff, the OSPD alone would be a lot smaller.
>
> > > <http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:start>
>
> > Sorry for my ignorance: I can query an Access DB w/ standard SQL
> > queries (and this is how I would access it w/ Python)?
>
> Yes, if you have the appropriate way to link to the DB.
> I use Windows and ODBC from Win32. I don't know what you
> would use on a Mac.
>
> As Paul McGuire said, you could easily do this with SqlLite3.
> Personnaly, I always use Access since my job requires it
> and I find it much more convenient. I often use Crosstab
> tables which I think SqlLite3 doesn't support. Typically,
> I'll write complex queries in Access and simple select SQL
> statements in Python to grab them.
>
> Here's my anagram locator. (the [signature] is an example
> of the value-added I mentioned).
>
> ##  I took a somewhat different approach. Instead of in a file,
> ##  I've got my word list (562456 words) in an MS-Access database.
> ##  And instead of calculating the signature on the fly, I did it
> ##  once and added the signature as a second field:
> ##
> ##  TABLE CONS_alpha_only_signature_unique
> ##  --------------------------------------
> ##  CONS       text      75
> ##  signature  text      26
> ##
> ##  The signature is a 26 character string where each character is
> ##  the count of occurences of the matching letter. Luckily, in
> ##  only a single case was there more than 9 occurences of any
> ##  given letter, which turned not to be a word but a series of
> ##  words concatenated so I just deleted it from the database
> ##  (lots of crap in the original word list I used).
> ##
> ##  Example:
> ##
> ##  CONS     signature
> ##  aah      20000001000000000000000000 # 'a' occurs twice & 'h' once
> ##  aahed    20011001000000000000000000
> ##  aahing   20000011100001000000000000
> ##  aahs     20000001000000000010000000
> ##  aaii     20000000200000000000000000
> ##  aaker    20001000001000000100000000
> ##  aal      20000000000100000000000000
> ##  aalborg  21000010000100100100000000
> ##  aalesund
> 20011000000101000010100000
> ##
> ##  Any words with identical signatures must be anagrams.
> ##
> ##  Once this was been set up, I wrote a whole bunch of queries
> ##  to use this table. I use the normal Access drag and drop
> ##  design, but the SQL can be extracted from each, so I can
> ##  simply open the query from Python or I can grab the SQL
> ##  and build it inside the program. The example
> ##
> ##    signatures_anagrams_select_signature
> ##
> ##  is hard coded for criteria 9 & 10 and should be cast inside
> ##  Python so the criteria can be changed dynamically.
> ##
> ##
> ##  QUERY signatures_anagrams_longest
> ##  ---------------------------------
> ##  SELECT   Len([CONS]) AS Expr1,
> ##           Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ##           Cons_alpha_only_signature_unique.signature
> ##  FROM     Cons_alpha_only_signature_unique
> ##  GROUP BY Len([CONS]),
> ##           Cons_alpha_only_signature_unique.signature
> ##  HAVING   (((Count(Cons_alpha_only_signature_unique.CONS))>1))
> ##  ORDER BY Len([CONS]) DESC ,
> ##           Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ##  This is why I don't use SQLite3, must have crosstab queries.
> ##
> ##  QUERY signatures_anagram_summary
> ##  --------------------------------
> ##  TRANSFORM Count(signatures_anagrams_longest.signature) AS
> CountOfsignature
> ##  SELECT    signatures_anagrams_longest.Expr1 AS [length of word]
> ##  FROM      signatures_anagrams_longest
> ##  GROUP BY  signatures_anagrams_longest.Expr1
> ##  PIVOT     signatures_anagrams_longest.CountOfCONS;
> ##
> ##
> ##  QUERY signatures_anagrams_select_signature
> ##  ------------------------------------------
> ##  SELECT   Len([CONS]) AS Expr1,
> ##           Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ##           Cons_alpha_only_signature_unique.signature
> ##  FROM     Cons_alpha_only_signature_unique
> ##  GROUP BY Len([CONS]),
> ##           Cons_alpha_only_signature_unique.signature
> ##  HAVING   (((Len([CONS]))=9) AND
> ##            ((Count(Cons_alpha_only_signature_unique.CONS))=10))
> ##  ORDER BY Len([CONS]) DESC ,
> ##           Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ##  QUERY signatures_lookup_by_anagram_select_signature
> ##  ---------------------------------------------------
> ##  SELECT     signatures_anagrams_select_signature.Expr1,
> ##             signatures_anagrams_select_signature.CountOfCONS,
> ##             Cons_alpha_only_signature_unique.CONS,
> ##             Cons_alpha_only_signature_unique.signature
> ##  FROM       signatures_anagrams_select_signature
> ##  INNER JOIN Cons_alpha_only_signature_unique
> ##  ON         signatures_anagrams_select_signature.signature
> ##             = Cons_alpha_only_signature_unique.signature;
> ##
> ##
> ##  Now it's a simple matter to use the ODBC from Win32 to extract
> ##  the query output into Python.
>
> import dbi
> import odbc
>
> con = odbc.odbc("words")
> cursor = con.cursor()
>
> ##  This first section grabs the anagram summary. Note that
> ##  queries act just like tables (as long as they don't have
> ##  internal dependencies. I read somewhere you can get the
> ##  field names, but here I put them in by hand.
>
> ##cursor.execute("SELECT * FROM signature_anagram_summary")
> ##
> ##results = cursor.fetchall()
> ##
> ##for i in results:
> ##  for j in i:
> ##    print '%4s' % (str(j)),
> ##  print
>
> ##  (if this wraps, each line is 116 characters)
> ##        2    3    4    5    6    7    8    9   10   11   12   13
> 14   15   16   17   18   23
> ##   2  259 None None None None None None None None None None None
> None None None None None None
> ##   3  487  348  218  150  102 None None None None None None None
> None None None None None None
> ##   4 1343  718  398  236  142  101   51   26   25    9    8    3
> 2 None None None None None
> ##   5 3182 1424  777  419  274  163  106   83   53   23   20   10
> 6    4    5    1    3    1
> ##   6 5887 2314 1051  545  302  170  114   54   43   21   15    6
> 5    4    4    2 None None
> ##   7 7321 2251  886  390  151   76   49   37   14    7    5    1
> 1    1 None None None None
> ##   8 6993 1505  452  166   47   23    8    6    4    2    2 None
> None None None None None None
> ##   9 5127  830  197   47   17    6 None None    1 None None None
> None None None None None None
> ##  10 2975  328   66    8    2 None None None None None None None
> None None None None None None
> ##  11 1579  100    5    4    2 None None None None None None None
> None None None None None None
> ##  12  781   39    2    1 None None None None None None None None
> None None None None None None
> ##  13  326   11    2 None None None None None None None None None
> None None None None None None
> ##  14  166    2 None None None None None None None None None None
> None None None None None None
> ##  15   91 None    1 None None None None None None None None None
> None None None None None None
> ##  16   60 None None None None None None None None None None None
> None None None None None None
> ##  17   35 None None None None None None None None None None None
> None None None None None None
> ##  18   24 None None None None None None None None None None None
> None None None None None None
> ##  19   11 None None None None None None None None None None None
> None None None None None None
> ##  20    6 None None None None None None None None None None None
> None None None None None None
> ##  21    6 None None None None None None None None None None None
> None None None None None None
> ##  22    4 None None None None None None None None None None None
> None None None None None None
>
> ##  From the query we have the word size as row header and size of
> ##  anagram set as column header. The data value is the count of
> ##  how many different anagram sets match the row/column header.
> ##
> ##  For example, there are 7321 different 7-letter signatures that
> ##  have 2 anagram sets. There is 1 5-letter signature having a
> ##  23 member anagram set.
> ##
> ##  We can then pick any of these, say the single 10 member anagram
> ##  set of 9-letter words, and query out out the anagrams:
>
> cursor.execute("SELECT * FROM
> signatures_lookup_by_anagram_select_signature")
> results = cursor.fetchall()
> for i in results:
>   for j in i:
>     print j,
>   print
>
> ##  9 10 anoretics 10101000100001100111000000
> ##  9 10 atroscine 10101000100001100111000000
> ##  9 10 certosina 10101000100001100111000000
> ##  9 10 creations 10101000100001100111000000
> ##  9 10 narcotise 10101000100001100111000000
> ##  9 10 ostracine 10101000100001100111000000
> ##  9 10 reactions 10101000100001100111000000
> ##  9 10 secration 10101000100001100111000000
> ##  9 10 tinoceras 10101000100001100111000000
> ##  9 10 tricosane 10101000100001100111000000
>
> ## Nifty, eh?
>
>
>
> > DG

Yes, nifty.  Thanks for all the help, all!

DG