[Tutor] Bigrams and nested dictionaries

Tue Apr 4 06:52:53 CEST 2006

My only comment is that this considers spaces and punctuation (like
parentheses, brackets, etc.), too, which I assume you don't want seeing as
how that has little to do with natural languages.  My suggestion would be to
remove the any punctuation or whitespace keys from the dictionary after
you've built the dictionary.  Unless, of course, you're interested in what
letter words start with, in which case you could somehow merge them into a
single key?

Cheers,
Orri

On 4/4/06, Michael Broe <mbroe at columbus.rr.com> wrote:
>
> Well coming up with this has made me really love Python. I worked on
> this with my online pythonpenpal Kyle, and here is what we came up
> with. Thanks to all for input so far.
>
> My first idea was to use a C-type indexing for-loop, to grab a two-
> element sequence [i, i+1]:
>
> dict = {}
> for i in range(len(t) - 1):
>         if not dict.has_key(t[i]):
>                 dict[t[i]] = {}
>         if not dict[t[i]].has_key(t[i+1]):
>                 dict[t[i]][t[i+1]] = 1
>         else:
>                 dict[t[i]][t[i+1]] += 1
>
> Which works, but. Kyle had an alternative take, with no indexing, and
> after we worked on this strategy it seemed very Pythonesque, and ran
> almost twice as fast.
>
> ----
>
> #!/usr/local/bin/python
>
> import sys
> file = open(sys.argv[1], 'rb').read()
>
> # We imagine a 2-byte 'window' moving over the text from left to right
> #
> #          +-------+
> # L  o  n  | d   o |  n  .  M  i  c  h  a  e  l  m  a  s      t  e
> r  m ...
> #          +-------+
> #
> # At any given point, we call the leftmost byte visible in the window
> 'L', and the
> # rightmost byte 'R'.
> #
> #          +-----------+
> # L  o  n  | L=d   R=o |  n  .  M  i  c  h  a  e  l  m  a  s      t
> e  r  m ...
> #          +-----------+
> #
> # When the program begins, the first byte is preloaded into L, and we
> position R
> # at the second byte of the file.
> #
>
> dict = {}
>
> L = file[0]
> for R in file[1:]:      # move right edge of window across the file
>         if not L in dict:
>                 dict[L] = {}
>
>         if not R in dict[L]:
>                 dict[L][R] = 1
>         else:
>                 dict[L][R] += 1
>
>         L = R           # move character in R over to L
>
> # that's it. here's a printout strategy:
>
> for entry in dict:
>         print entry, ':', sum(dict[entry].values())
>         print dict[entry]
>         print
>
> ----
>
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>

--
Email: singingxduck AT gmail DOT com
AIM: singingxduck
Programming Python for the fun of it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20060404/9873c3f4/attachment.html