Anyone know anything named DX?

Alex Martelli aleaxit at yahoo.com
Tue Sep 7 05:43:35 EDT 2004


Leif K-Brooks <eurleif at ecritters.biz> wrote:

> Alex Martelli wrote:
> > There's a recipe for the first part of this (generating
> > non-totally-ranom passwords by pastiche, i.e. Markov Chain) in the 1st
> > printed edition of the Cookbook -- it would be neat to add a back-end
> > for the second part, the check with the Google API...
> 
> I've played around with Markov Chains the Google API before, and it 
> wouldn't be very hard to implement (if you don't care about speed or 
> sanity, anyway). I think the toughest part would be gathering word lists
> for the subject matter Roger Binns mentioned.

Heh, gathering (and cleaning up, etc) good clean corpora was indeed the
hardest part of building a Markov model for natural language (for speech
recognition purposes) as we were doing in IBM Research starting about 20
years ago -- that's when I learned to love scripting, AKA very high
level, languages (at that time and place, that meant Rexx).

But today, with so much material on any given field in any given
language available from the web, the task is _way_ easier -- for the
generic Italian corpus of the '80s we had to "reverse engineer" the text
from tens of millions of words that were available in machine-readable
form only as binary files ready to drive some kind of photocomposer,
kindly suppied to us by various newspapers, agencies and
publishers...!-)

Oops, I'm slipping into warstories, like us old codgers tend to do, I'd
better stop right here!  Still, the advice is to wget or urllib.get a
bunch of web pages of interest and format them into reasonably clean
text -- shouldn't be all THAT tough!


Alex



More information about the Python-list mailing list