[Tutor] need some help for program!! [hidden markov models?]

Mon Nov 18 01:01:09 2002

On Mon, 18 Nov 2002, sachin mehra wrote:

> I have to take a sentence as input that has had all the spaces ( and
> punctuation) removed and put the spaces back in. I have to use a search
> based algorithm on a probabilistic language model to recover spaces.

Forgive me, but this really sounds like a homework problem.  It's simply
not right for us to help you directly on this.

There's a lot of good information on the web that you can look at to
better understand your problem.  Check:

    http://datamining.anu.edu.au/software/febrl/febrldoc/node7.html

in particular.

> It looks like to me that either I can use one of the following: Train
> the model with HMM & then use viterbi algorithm with beam search OR
> Train the model with HMM & then use A*/Best first algorithm.
>
> I had another question.How can I use HMM(Hidden markov models) to train
> a model given some text?

Wait, wait.  You're jumping way ahead of yourself!  The first part of your
question depends entirely on knowing why you want to use a HMM for this
particular problem.  If you don't know what the HMM is for, or how to
construct one, how can you even propose to use it to solve your problem?

Splitting a spaceless sentence should probably be your first priority, and
it's not to hard to cook something up to do sentence partitioning.  It
should only take a few lines of code to write a function that does
something like this:

###
>>> paritition("attackatdawn")
[['at', 'tack', 'at', 'dawn'],
 ['attack', 'at', 'dawn']]

>>> partition("iscreamforicecream")
[['is', 'cream', 'for', 'ice', 'cream'],
 ['i', 'scream', 'for', 'ice', 'cream']]

>>> partition("uselesspython")
[['use', 'less', 'python'],
 ['useless', 'python']]

>>> partition("inaholetherelivedahobbit")
[['in', 'a', 'hole', 'the', 're', 'lived', 'a', 'hobbit'],
 ['in', 'a', 'hole', 'there', 'lived', 'a', 'hobbit']]
###

(My lexicon is just '/usr/dict/words' with some modifications, so the
partitions aren't that inventive...)  So partitioning is not (or should
not be) the real heart of your problem.

What's hard is figuring out, given a list of possible partitions, which
one makes the most grammatical sense!  When we see sentences like

    is cream for ice cream

or

    i scream for ice cream

How do we know which one is the "better" or "more probable"  English
sentence?  Perhaps, in our brains, we're attaching semantic meaning to
each word.  Instinctively, we know that the first form "is cream for ice
cream" simply doesn't make as much sense as the second "i scream for ice
cream".  The transition from the word "is" to "cream" just doesn't "feel"
right.

It's too bad we don't know exactly how our brains do it yet.

But that's where the Hidden Markov Models's should come in.  A Hidden
Markov Model, properly trained on grammatically tagged english sentences,
can eat something like "is cream for ice cream" and say "the probability
of such a sentence existing in english is: 0.000001291...".

Your model might not even care about the particular words, but may only
pay attention to the grammatical roles of each word.  ("cream" is a
"noun", "for" is a preposition, etc...)

So you may need to break your problem into a few steps: partitioning,
stamping each possible partition with grammatical tags, and then feeding
those tagged partitions into your HMM.

Again, the page at:

    http://datamining.anu.edu.au/software/febrl/febrldoc/node7.html

has an example that should help clarify why HMM's are appropriate for your
problem.

I hope that made some sort of sense.  Good luck to you.