[Tutor] Building dictionary from large txt file

dn PythonList at DancesWithMice.info
Tue Jul 26 22:36:37 EDT 2022


On 27/07/2022 08.58, bobx ander wrote:
> Hi all,
> I'm trying to build a dictionary from a rather large file of following
> format after it has being read into a list(excerpt from start of list below)
> --------
> 
> Atomic Number = 1
>     Atomic Symbol = H
>     Mass Number = 1
>     Relative Atomic Mass = 1.00782503223(9)
>     Isotopic Composition = 0.999885(70)
>     Standard Atomic Weight = [1.00784,1.00811]
>     Notes = m
> --------
> 
> My goal is to extract the content into a dictionary that displays each
> unique triplet as indicated below
> {'H1': {'Z': 1,'A': 1,'m': 1.00782503223},
>               'D2': {'Z': 1,'A': 2,'m': 2.01410177812}
>                ...} etc
> My code that I have attempted is as follows:
> 
> filename='ex.txt'
> 
> afile=open(filename,'r') #opens the file
> content=afile.readlines()
> afile.close()
> isotope_data={'Z':0,'A':0,'m':0}#start to create subdictionary for
> each case of atoms with its unique keys and values
> for line in content:
>     data=line.strip().split()
> 
>     if len(data)<1:
>         pass
>     elif data[0]=="Atomic" and data[1]=="Number":
>         atomic_number=data[3]
> 
> 
>      elif data[0]=="Mass" and data[1]=="Number":
>         mass_number=data[3]
> 
> 
> 
>     elif data[0]=="Relative" and data[1]=="Atomic" and data[2]=="Mass":
>         relative_atomic_mass=data[4]
> 
> 
> isotope_data['Z']=atomic_number
> isotope_data['A']=mass_number
> isotope_data['A']=relative_atomic_mass
> isotope_data


+1 after @Alan: it is difficult to ascertain how the dictionary is
transformed from the input file (not list!).

Because things are "not work[ing]" the code is evidently 'too complex'.
NB this is not an insult to your intelligence. It is however a
reflection on your programming expertise/experience, and/or your Python
expertise. The (recommended) answer is to break-down the total problem
into smaller units which you-personally can 'see' and confidently
manage. (which level of detail or size of "chunk", is different for each
of us!)


Is the file *guaranteed* to have all seven lines per isotope (or
whatever we have to imagine it contains)?

Alternately, are some 'isotopes' described with fewer than seven lines
of data? In which case, each line must be read and 'understood' - plus
any missing data must be handled, presumably with some default value or
an indicator that such data is not available.


The first option seems more likely. Note how (first line, above) the
problem was expressed, perhaps 'backwards'! This is because it is easier
to understand that way around - and possibly a source of the problem
described.

So, here is a suggested approach - with the larger-problem broken-down
into smaller (and more-easily understood) units:-


The first component to code is a Python-generator which opens the file
(use a Context Manager if you want to be 'advanced'/an excuse to learn
such), reads a line, 'cleans' the data, and "yield"s the data-value;
'rinse and repeat'.

Next, (up to) seven 'basic-functions', representing each of the
dictionary entries/lines in the data file. These will be very similar to
each-other, but each is solely-devoted to creating one dictionary entry
from the data generated by the generator. If they are called in the
correct sequence, each call will correspond to the next (expected)
record being read-in from the data-file.

I'm assuming (in probable ignorance) that some data-items are
collated/collected together as nested dictionaries. In which case,
another 'level' of subroutine may be required - an 'assembly-function'.
This/these will call 'however many' of the above 'basic-functions' in
order to assemble a dictionary-entry which contains a dictionary as its
"value" (dicts are "key"-"value" pairs - in case you haven't met this
terminology before).

Those 'assembly-functions' will return that more complex dictionary
entry. We can now 'see' that the one-to-one relationship between a
dictionary sub-structure is more important than any one-to-one
relationship with the input file! Thus, given that the objective is to
build "a dictionary" of "unique triplet[s]", each function should return
a sub-component of that 'isotope's' entry in the dictionary - some
larger sub-components and others a single value or key-value pair!

Finally then, the 'top level' is a loop-forever until the generator
returns an 'end of file' exception. The loop calls each basic-function
or assembly-function in-turn, and either gradually or 'at the bottom of
each loop' assembles the dictionary-entry for that 'isotope' and adds it
to the dictionary.


Try a main-loop which looks something like:

# init dict

while "there's data":
  atomic_number = get_atomic_number()
  atomic_symbol = get_atomic_symbol()
  assemble_atomic_mass = get_atomic_mass()
  # etc
  assemble_dict_entry( atomic_number, atomic_symbol, ... )

  # probably only need a try...except around the first call
  # which will break out of the while-loop

# dict is now fully assembled and ready for use...


# sample 'assembly-function'
def assemble_atomic_mass():
  # init sub-dict
  mass_number = get_mass_number()
  relative_atomic_mass = get_relative_atomic_mass()
  #etc
  # assemble sub-dict entry with atomic mass data
  return sub-dict

# repeat above with function for each sub-dict/sub-collection of data

# which brings us to the individual data-items. These, it is implied,
appear on separate lines of the data file, but in sets of seven
data-lines (am ignoring lines of dashes, but if present, then eight-line
sets). Accordingly:

def get_atomic_number():
  get_next_line()
  # whatever checks/processing
  return atomic_number

# and repeat for each of the seven data-items
# if necessary, add read-and-discard for line of dashes

# all the input functionality has been devolved to:

def get_next_line():
  # a Python generator which
  # open the file
  # loop-forever
    # reads single line/record
    # (no need for more - indeed no point in reading the whole and then
having to break that down!)
    # strip, split, etc
    # yield data-value
  # until eof and an exception will be 'returned' and ripple 'up' the
hierarchy of functions to the 'top-level'.


Here is another question: having assembled this dictionary, what will be
done with it? Always start at that back-end - we used to mutter the
mantra "input - process - output" and start 'backwards' (you've probably
already noted that!)


Another elegant feature is that each of the functions (starting from the
lowest level) can be developed and tested individually (or tested and
developed if you practice "TDD"). By testing that the generator returns
the data-file's records appropriately, the complexity of writing and
testing the next 'layer' of subroutine/function becomes easier - because
you will know that at least half of it 'already works'! Each (working)
small module can be built-upon and more-easily assembled into a working
whole - and if/when something 'goes wrong', it will most likely be
contained (only) within the newly-developed code!

(of course, if a fault is found to be caused by 'lower level code' (draw
conclusion here), then, provided the tests have been retained, the test
for that lower-level can be expanded with the needed check, the tests
re-run, and one's attention allowed to rise back 'up' through the
layers...)

"Divide and conquer"!

-- 
Regards,
=dn


More information about the Tutor mailing list