[Tutor] updating a dictionary

Fri Feb 20 11:59:36 CET 2015

Chris Stinemetz wrote:

> Here is a sample of the input data, it is tab delimited and I chopped it
> down for example purposes:
> 
> 
>  KSL03502_7A_1 11.5921
> KSL03502_7B_1 46.4997
> KSL03502_7C_1 13.5839
> KSL03505_7A_1 12.8684
> KSL03505_7B_1 16.5311
> KSL03505_7C_1 18.9926
> KSL03509_7A_1 3.4104
> KSL03509_7B_1 40.6244
> KSL03509_7C_1 51.0597
> KSL03511_7A_1 7.128
> KSL03511_7B_1 53.4401
> KSL03511_7C_1 66.2584
> KSL03514_2A_1 25.6476
> KSL03514_2B_1 53.17
> KSL03514_2C_1 11.6469
> KSL03514_7A_1 39.2292
> KSL03514_7B_1 65.675
> KSL03514_7C_1 3.4937
> 
> 
> I would like to parse it buy using a dictionary structure. Where each row
> would be something like:
> 
> name 7,8,9,2
> KSL03514_C,3.4937,,,11.6469
> KSL03514_B,65.675,,,53.17
> 
> I am just showing an example of what KSL03514_7C_1, KSL03514_2C_1,
> KSL03514_7B_1, KSL03514_2B_1 would parse.
> 
> Hope this helps explain what I am trying to accomplish.

You need to merge multiple lines into one row dict and you'll end up with 
multiple such rowdicts. The easiest way to keep them around is to put them 
into an outer dict that maps keys like "KSL03514_B" to the corresponding 
rowdict. This will start with

{'2': '53.17', 'name': 'KSL03514_B'}

in line

> KSL03514_2B_1 53.17

and be updated to

{'7': '65.675', '2': '53.17', 'name': 'KSL03514_B'}

when line

> KSL03514_7B_1 65.675

is encountered. The "name" item is redundant because it's the same as the 
key in the outer dict

{'KSL03502_A': {'7': '11.5921', 'name': 'KSL03502_A'},
 'KSL03502_B': {'7': '46.4997', 'name': 'KSL03502_B'},
  ...
 'KSL03514_B': {'2': '53.17', '7': '65.675', 'name': 'KSL03514_B'},
 'KSL03514_C': {'2': '11.6469', '7': '3.4937', 'name': 'KSL03514_C'}}

but it simplifies generating the resulting file.

If you want to cheat, here's the code I came up with:

import csv
import operator
import sys
import logging

logger = logging.getLogger()

def read_data(infile):
    """Combine lines in infile with same <name> into one dict.

    Returns a sorted list of such dicts.

    Expected line format:
      <basename>_<prefix><suffix>_<don't care><whitespace><value><newline>
    where
      <prefix> digits only
      <suffix> non-digit followed by any non-"_"

    Then
      <name> = <basename>_<suffix>
    """
    # map <name> to rowdict
    # rowdict maps <prefix> to <value> and "name" to <name>
    rows_by_name = {}

    for line in infile:
        # key format:
        # <basename>_<prefix><suffix>_<don't care>
        key, value = line.split()

        basename, both, dummy = key.split("_")
        suffix = both.lstrip("0123456789")
        prefix = both[:len(both)-len(suffix)]
        name = basename + "_" + suffix
        rowdict = rows_by_name.setdefault(name, {"name": name})
        if prefix in rowdict:
            # we are going to overwrite a column value
            # may raise an exception instead
            logger.warn("duplicate column %s=%r for %s",
                        prefix, value, name)
        rowdict[prefix] = value

    return sorted(rows_by_name.values(), key=operator.itemgetter("name"))

def main():
    logging.basicConfig()

    with open("PRB_utilization.txt") as infile:
        rows = read_data(infile)

    writer = csv.DictWriter(
        sys.stdout,  # may replace stdout with any writable file object
        fieldnames=["name", "7", "8", "9", "2"]
    )
    writer.writeheader()
    writer.writerows(rows)

if __name__ == "__main__":
    main()