Dict to "flat" list of (key,value)

Bengt Richter bokr at oz.net
Sun Aug 3 01:17:40 EDT 2003


On Sun, 03 Aug 2003 00:28:53 GMT, sjmachin at lexicon.net (John Machin) wrote:

>On Sat, 02 Aug 2003 12:41:19 GMT, "Raymond Hettinger"
><vze4rx4y at verizon.net> wrote:
>
>
>>it-all-starts-with-a-good-data-structure-ly yours,
>
>Amen, brother. 
>
>Even worse: I recall seeing code somewhere that had a data structure
>like this:
>
>{k1: [v1,v2], k2: v3, k3: None, ...}
>instead of 
>{k1: [v1,v2], k2: [v3], k3: [], ...}
>
Page 1

>I liked the elegant code example for building a book index. However in
>practice the user requirement would be not to have duplicated page
>numbers when a word occurs more than once on the same page. If you can
>achieve that elegantly, please post it!
>
Page 2

Don't know about elegance, but I wanted to play with my new 2.3 version ;-)

Assuming the Timbot has special-cased small sets efficiently, and
assuming pageSeq below is a generator can yield a sequence of word-iterable
page objects (e.g. lists of words left after cleaning out punctuation
etc of a page, however page boundaries are detected, etc etc):

(FTHOI, the following pageSeq generator expects to find a "Page xxx"
line ending each page, with nothing else on the line). I'm going to use
this post as a "book" ;-)

Page 3


====< bookindex.py >=================================================

keepalnum = ''.join([c.isalnum() and c or ' ' for c in [chr(i) for i in range(256)]])

def pageSeq(bookFile): # expect open bookFile ready to read
    pageWords = []
    pageId = '<prologue>'
    while 1:
        line = bookFile.readline()
        if not line: break
        lineToks = line.translate(keepalnum).split()
        if not lineToks: continue
        # expect single footer line with only "Page pgid" at end of page
        if len(lineToks)==2 and lineToks[0]=='Page':
            pageId = lineToks[1]
            yield pageId, pageWords
            pageWords = []; pageId = '(after %s)'% pageId
        else: pageWords.extend([tok for tok in lineToks if tok.isalpha()]) # rej digits in words
    if pageWords:
        yield pageId, pageWords

def bookIndex(bookFile):
    # (untested, requires 2.3 features)
    from sets import Set
    wordIndex = {}
    for pgId, page in pageSeq(bookFile):
        for word in page:
            try:
                wordIndex[word].add(pgId)
            except KeyError:
                wordIndex[word] = Set([pgId])

    words = wordIndex.keys()
    words.sort()
    for word in words:
        pageIds = list(wordIndex[word])
        pnMaxWidth = max(map(len,pageIds))
        pageIds = [pid.rjust(pnMaxWidth) for pid in pageIds]
        pageIds.sort()
        pageIds = map(str.strip, pageIds)
        yield (word, ', '.join(pageIds))

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    if args:
        arg = args.pop(0)
        if arg == '-': f = sys.stdin
        else: f = file(arg)
        iCol=0
        for wordOccLine in bookIndex(f):
            print ('%14s: %-10s' % wordOccLine),
            if iCol%3 == 2: print
            iCol += 1
        print
    else:
        raise SystemExit(
            'Usage: python bookindex.py bookfilepath\n'
            '   a single minus for bookfilepath uses sys.stdin')
=====================================================================
Page CODE

I started to post just the bookIndex generator (still marked untested, which is
almost right still ;-) Then one think led to another ;-)

Page LAST

I'll copy above this line to the clip board and filter it through. Result:

[22:22] C:\pywk\clp>getclip | python bookindex.py - |more
          Amen: 1                Assuming: 3                     Aug: 1
           Don: 3                    Even: 1                   FTHOI: 3
           GMT: 1               Hettinger: 1                 However: 2
             I: 1, 2, 3, LAST             If: 2                    John: 1
      KeyError: CODE               Machin: 1                    None: 1
            On: 1                    Page: 3, CODE           Raymond: 1
           Sat: 1                     Set: CODE                  Sun: 1
    SystemExit: CODE                 Then: LAST               Timbot: 3
         Usage: CODE                    a: 1, 2, 3, CODE          about: 3
       achieve: 2                     add: CODE                after: 3, CODE
           all: 1                  almost: LAST                  and: 3, CODE
       another: LAST                  are: 3                     arg: CODE
          args: CODE                 argv: CODE                   as: 3
      assuming: 3                      at: CODE                   be: 2
         below: 3                    book: 2, 3             bookFile: CODE
     bookIndex: CODE, LAST   bookfilepath: CODE            bookindex: CODE
    boundaries: 3                   break: CODE              brother: 1
      building: 2                     but: 3                       c: CODE
           can: 2, 3                cased: 3                     chr: CODE
      cleaning: 3                    code: 1, 2             continue: CODE
          data: 1                     def: CODE             detected: 3
        digits: CODE           duplicated: 2                       e: 3
          each: 3             efficiently: 3                elegance: 3
       elegant: 2               elegantly: 2                    else: 3, CODE
           end: CODE               ending: 3                     etc: 3
       example: 2                  except: CODE               expect: CODE
       expects: 3                  extend: CODE                    f: CODE
      features: CODE                 file: CODE                 find: 3
     following: 3                  footer: CODE                  for: 2, CODE
          from: CODE                    g: 3               generator: 3, LAST
         going: 3                    good: 1                     had: 1
           has: 3                    have: 2                 however: 3
             i: CODE                 iCol: CODE                   if: CODE
        import: CODE                   in: 2, CODE             index: 2
       instead: 1                      is: 3, LAST           isalnum: CODE
       isalpha: CODE                   it: 1, 2             iterable: 3
          join: CODE                 just: LAST            keepalnum: CODE
          keys: CODE                 know: 3                     led: LAST
          left: 3                     len: CODE              lexicon: 1
          like: 1                   liked: 2                    line: 3, CODE
      lineToks: CODE                 list: CODE                lists: 3
            ly: 1                       m: 3                    main: CODE
           map: CODE               marked: LAST                  max: CODE
         minus: CODE                 more: 2                      my: 3
             n: CODE                 name: CODE                  net: 1
           new: 3                     not: 2, CODE           nothing: 3
       numbers: 2                 objects: 3                  occurs: 2
            of: 1, 3, CODE             on: 2, 3                 once: 2
           one: LAST                 only: CODE                 open: CODE
            or: CODE                  out: 3                    page: 2, 3, CODE
        pageId: CODE              pageIds: CODE              pageSeq: 3, CODE
     pageWords: CODE                 pgId: CODE                 pgid: CODE
           pid: CODE                 play: 3                  please: 2
    pnMaxWidth: CODE                  pop: CODE                 post: 2, 3, LAST
      practice: 2                   print: CODE             prologue: CODE
   punctuation: 3                      py: CODE               python: CODE
         raise: CODE                range: CODE                 read: CODE
      readline: CODE                ready: CODE               recall: 1
           rej: CODE          requirement: 2                requires: CODE
         right: LAST                rjust: CODE                    s: CODE
          same: 2                  seeing: 1                sequence: 3
          sets: 3, CODE            single: CODE             sjmachin: 1
         small: 3               somewhere: 1                    sort: CODE
       special: 3                   split: CODE              started: LAST
        starts: 1                   stdin: CODE                still: LAST
           str: CODE                strip: CODE            structure: 1
           sys: CODE                    t: 3                    than: 2
          that: 1, 2                  the: 2, 3, LAST          think: LAST
          this: 1, 3                   to: 2, 3, CODE, LAST            tok: CODE
     translate: CODE                  try: CODE             untested: CODE, LAST
           use: 3                    user: 2                    uses: CODE
       verizon: 1                 version: 3                  wanted: 3
          when: 2                   which: LAST                while: CODE
          with: 1, 3, CODE           word: 2, 3, CODE      wordIndex: CODE
   wordOccLine: CODE                words: 3, CODE             worse: 1
         would: 2                   wrote: 1                     xxx: 3
         yield: 3, CODE               you: 2                   yours: 1



Regards,
Bengt Richter




More information about the Python-list mailing list