gather information from various files efficiently

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed Dec 15 09:20:42 EST 2004


"Klaus Neuner" <klaus_neuner82 at yahoo.de> wrote in message
news:3e96ebd7.0412140111.69244c7c at posting.google.com...
> Hello,
>
> I need to gather information that is contained in various files.
>
> Like so:
>
> file1:
> =====================
> foo : 1 2
> bar : 2 4
> baz : 3
> =====================
>
> file2:
> =====================
> foo : 5
> bar : 6
> baz : 7
> =====================
>
> file3:
> =====================
> foo : 4 18
> bar : 8
> =====================
>
>
> The straightforward way to solve this problem is to create a
> dictionary. Like so:
>
>
> [...]
>
> a, b = get_information(line)
> if a in dict.keys():
>     dict[a].append(b)
> else:
>     dict[a] = [b]
>
>
> Yet, I have got 43 such files. Together they are 4,1M
> large. In the future, they will probably become much larger.
> At the moment, the process takes several hours. As it is a process
> that I have to run very often, I would like it to be faster.
>
> How could the problem be solved more efficiently?
>
>
> Klaus

You have gotten a number of suggestions on the relative improvements for
updating your global dictionary of values.  My business partner likens code
optimization to lowering the water in a river.  Performance bottlenecks
stick out like rocks sticking out of a river.  Once you resolve one problem
(remove the rock), you lower the water level, and the next rock/bottleneck
appears.  Have you looked at what is happening in your get_information
method?  If you are still taking long periods of time to scan through these
files, you should look into what get_information is doing.  In working with
my pyparsing module, I've seen people scan multimegabyte files in seconds,
so taking hours to sift through 4Mb of data sounds like there may be other
problems going on.

With this clean a code input, something like:

    def get_information(line):
        return map(str.strip, line.split(":",1))

should do the trick.  For that matter, you could get rid of the function
call (calls are expensive in Python), and just inline this to :

a,b = map(str.strip, line.split(":",1))
if a in dct:
    dct[a] += b.split()
else:
    dct[a] = b.split()

(I'm guessing you want to convert b values that have multiple numbers to a
list, based on your "dict[a] = [b]" source line.)
I also renamed dict to dct, per Fernando Perez's suggestion.

-- Paul





More information about the Python-list mailing list