updating dictionaries from/to dictionaries

Tue Aug 12 17:29:00 EDT 2008

On Aug 13, 5:33 am, Brandon <your.mas... at gmail.com> wrote:
> On Aug 12, 7:26 am, John Machin <sjmac... at lexicon.net> wrote:
>
>
>
> > On Aug 12, 12:26 pm, Brandon <your.mas... at gmail.com> wrote:
>
> > > You are very correct about the Laplace adjustment.  However, a more
> > > precise statement of my overall problem would involve training and
> > > testing which utilizes bigram probabilities derived in part from the
> > > Laplace adjustment; as I understand the workflow that I should follow,
> > > I can't allow myself to be constrained only to bigrams that actually
> > > exist in training or my overall probability when I run through testing
> > > will be thrown off to 0 as soon as a test bigram that doesn't exist in
> > > training is encountered.  Hence my desire to find all possible bigrams
> > > in train (having taken steps to ensure proper set relations between
> > > train and test).
> > >  The best way I can currently see to do this is with
> > > my current two-dictionary "caper", and by iterating over foo, not
> > > bar :)
>
> > I can't grok large chunks of the above, especially these troublesome
> > test bigrams that don't exist in training but which you desire to find
> > in train(ing?).
>
> > However let's look at the mechanics: Are you now saying that your
> > original assertion "I am certain that all keys in bar belong to foo as
> > well" was not quite "precise"? If not, please explain why you think
> > you need to iterate (slowly) over foo in order to accomplish your
> > stated task.
>
> I was merely trying to be brief.  The statement of my certainty about
> foo/bar was precise as a stand-alone statement, but I was attempting
> to say that within the context of the larger problem, I need to
> iterate over foo.
>
> This is actually for a school project, but as I have already worked
> out a feasible (if perhaps not entirely optimized) workflow, I don't
> feel overly guilty about sharing this or getting some small amount of
> input - but certainly none is asked for beyond what you've given
> me :)  I am tasked with finding the joint probability of a test
> sequence, utilizing bigram probabilities derived from train(ing)
> counts.
>
> I have ensured that all members (unigrams) of test are also members of
> train, although I do not have any idea as to bigram frequencies in
> test.  Thus I need to iterate over all members of train for training
> bigram frequencies in order to be prepared for any test bigram I might
> encounter.
>
> The problem is that without Laplace smoothing, many POTENTIAL bigrams
> in train might have an ACTUAL frequency of 0 in train.  And if one or
> more of those bigrams which have 0 frequency in train is actually
> found in test, the joint probability of test will become 0, and that's
> no fun at all.  So I made foo dictionary that creates all POTENTIAL
> training bigrams with a smoothed frequency of 1.  I also made bar
> dictionary that creates keys of all ACTUAL training bigrams with their
> actual values.  I needed to combine the two dictionaries as a first
> step to eventually finding the test sequence probability.

Let's assume this need is real for the moment. Put this loop in your
code after the creation of foo and bar and before you "combine" them:

for key in bar:
   assert key in foo

Does it cause an exception? If so, either:
   you have a bug in the creation of foo or bar (or both!),
or:
   the certainty you had in making your opening statement "I am
certain that all keys in bar belong to foo as well" was not well-
founded.

If however it is correct that all keys in bar are also to be found in
foo, then the following snippets of code are equivalent for your
purpose of adding bar frequencies into foo:

(1) iterating over foo:
for key in foo:
   foo[key] += bar.get(key, 0)

(2) iterating over bar:
for key in bar:
   foo[key] += bar[key]

I (again) challenge you to say *why* you feel that the "iterating over
bar" solution will not work.

> So any
> bigram in test will at least have a smoothed train frequency of 1 and
> possibly a smoothed train frequency of the existing train value + 1.
> Having iterated over foo, foo becomes the dictionary which holds these
> smoothed & combined train frequencies.  I don't see a way to combine
> the two types of counts into one dictionary without keeping them
> separate first. Hence the caper.
>

Let's start with "So I made foo dictionary that creates all POTENTIAL
training bigrams with a smoothed frequency of 1". Let me guess that
you have a set W of all words ever used/usable in the language of the
texts that you are considering ... let N = len(W). So the number of
potential bigrams is N**2. Hmmm, how large is N, and have you actually
run the foo-building code yet?

Now, assuming foo does fit in memory etc, you get to the stage where
you have a test message containing a bigram b = (word1, word2). Its
smoothed frequency will be foo[b]. If b is in bar, this should be
equal to bar[b] + 1. Otherwise it will be 1.

So:
(1) foo[b] == bar.get(b, 0) + 1
(2) foo is redundant. If you want to check that b is "legal", use
(word1 in W and word2 in W).

Please attempt to refute the specific points above, rather than
writing another essay :-)

Cheers,
John