updating dictionaries from/to dictionaries

Brandon your.master at gmail.com
Tue Aug 12 15:33:48 EDT 2008


On Aug 12, 7:26 am, John Machin <sjmac... at lexicon.net> wrote:
> On Aug 12, 12:26 pm, Brandon <your.mas... at gmail.com> wrote:
>
>
>
> > You are very correct about the Laplace adjustment.  However, a more
> > precise statement of my overall problem would involve training and
> > testing which utilizes bigram probabilities derived in part from the
> > Laplace adjustment; as I understand the workflow that I should follow,
> > I can't allow myself to be constrained only to bigrams that actually
> > exist in training or my overall probability when I run through testing
> > will be thrown off to 0 as soon as a test bigram that doesn't exist in
> > training is encountered.  Hence my desire to find all possible bigrams
> > in train (having taken steps to ensure proper set relations between
> > train and test).
> >  The best way I can currently see to do this is with
> > my current two-dictionary "caper", and by iterating over foo, not
> > bar :)
>
> I can't grok large chunks of the above, especially these troublesome
> test bigrams that don't exist in training but which you desire to find
> in train(ing?).
>
> However let's look at the mechanics: Are you now saying that your
> original assertion "I am certain that all keys in bar belong to foo as
> well" was not quite "precise"? If not, please explain why you think
> you need to iterate (slowly) over foo in order to accomplish your
> stated task.

I was merely trying to be brief.  The statement of my certainty about
foo/bar was precise as a stand-alone statement, but I was attempting
to say that within the context of the larger problem, I need to
iterate over foo.

This is actually for a school project, but as I have already worked
out a feasible (if perhaps not entirely optimized) workflow, I don't
feel overly guilty about sharing this or getting some small amount of
input - but certainly none is asked for beyond what you've given
me :)  I am tasked with finding the joint probability of a test
sequence, utilizing bigram probabilities derived from train(ing)
counts.

I have ensured that all members (unigrams) of test are also members of
train, although I do not have any idea as to bigram frequencies in
test.  Thus I need to iterate over all members of train for training
bigram frequencies in order to be prepared for any test bigram I might
encounter.

The problem is that without Laplace smoothing, many POTENTIAL bigrams
in train might have an ACTUAL frequency of 0 in train.  And if one or
more of those bigrams which have 0 frequency in train is actually
found in test, the joint probability of test will become 0, and that's
no fun at all.  So I made foo dictionary that creates all POTENTIAL
training bigrams with a smoothed frequency of 1.  I also made bar
dictionary that creates keys of all ACTUAL training bigrams with their
actual values.  I needed to combine the two dictionaries as a first
step to eventually finding the test sequence probability.  So any
bigram in test will at least have a smoothed train frequency of 1 and
possibly a smoothed train frequency of the existing train value + 1.
Having iterated over foo, foo becomes the dictionary which holds these
smoothed & combined train frequencies.  I don't see a way to combine
the two types of counts into one dictionary without keeping them
separate first. Hence the caper.

Sorry for the small essay.

P.S. I do realize that there are better smoothing methods than
Laplace, but that is what the problem has specified.




More information about the Python-list mailing list