My Big Dict.

Wed Jul 2 09:32:09 EDT 2003

"Aurélien Géron" <ageron at HOHOHOHOvideotron.ca> wrote in message news:<bdua4i$18el$1 at biggoron.nerim.net>...
> "drs" wrote...
> > "Christophe Delord" <christophe.delord at free.fr> wrote in message
> > news:20030702073735.40293ba2.christophe.delord at free.fr...
> > > Hello,
> > >
> > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:
<snip>
> > > > I need advice on how I can convert a text db into a dict.  Here is an
> > > > example of what I need done.
> > > >
> > > > some example data lines in the text db goes as follows:
> > > >
> > > > CODE1!DATA1 DATA2, DATA3
> > > > CODE2!DATA1, DATA2 DATA3
<snip>
> > > > Any idea on how I can convert 20,000+ lines of the above into the
> > > > following protocol for use in my code?:
> > > >
> > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
> > > >
> > >
> > > If your data is in a string you can use a regular expression to parse
> > > each line, then the findall method returns a list of tuples containing
> > > the key and the value of each item. Finally the dict class can turn this
> > > list into a dict. For example:
<example snipped>
> >
> > and you can kill a fly with a sledgehammer.  why not
> >
> > f = open('somefile.txt')
> > d = {}
> > l = f.readlines()
> > for i in l:
> >     a,b = i.split('!')
> >     d[a] = b.strip()
<snip>
> Your code looks good Christophe.  Just two little things to be aware of:

I think I'm right in saying Christophe's approach was using the 're'
module, which has been snipped, whereas the approach was the above
using split was by "drs".

> 1) if you use split like this, then each line must contain one and only one
> '!', which means (in particular) that empy lines will bomb, and also data
> must not contain any '!' or else you'll get an exception such as
> "ValueError: unpack list of wrong size".   If your data may contain '!',
> then consider slicing up each line in a different way.

If this is a problem, use a combination of count and index methods to
find the first, and use slices. For example, if you don't mind
two-lined list comps:

d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
       for l in file('test.txt') if l.count('!')])

> 2) if your file is really huge, then you may want to fill up your dictionary
> as you're reading the file, instead of reading everything in a list and then
> building your dictionary (hence using up twice the memory).
Agreed.

The above list comprehension has the disadvantages that it finds how
many '!' characters for every line, and it reads the whole file in at
once. Assuming there are going to be more data lines than not, this is
much faster:

d={}
for l in file("test.txt"):
    try: i=l.index('!')
    except ValueError: continue
    d[l[:i]]=l[i+1:]

It's often much faster to ask forgiveness than permission. I measure
it about twice as fast as the 're' method, and about four times as
fast as the list comp above.
HTH,
Paul

> 
> But apart from these details, I agree with Christophe that this is the way
> to go.
> 
> Aurélien