Help optimize a script?

Wed Oct 17 15:15:11 EDT 2001

"Joseph Santaniello" <someone at _no-spam_arbitrary.org> wrote in message
news:Pine.LNX.4.33.0110171040490.22682-100000 at harmony.arbitrary.org...
> Hello All,
>
> I have a simple script that I wrote to convert some fixed-width
delimted
> files to tab-delimited.
>
> It works, but some of my files are over 100MB and it takes forever.
>
> First, does anyone know of a tool that does this so I don't have to
> reinvent the wheel, and barring that, can anyone offer some tips on
how to
> optimize this code:
>
> indecies = { 'cob':[3,6,2,2,8,1,8], 'opend':[6,3,3,2,3,4,12,29] }
> # above is trimmed for this example
> # the lists in these dictionaries above are are the widths of the
fields
> # in the input files. The keys match the input file names just to
keep
> things readable.
>
>
> # while is used cuz line in readlines() used too much ram with
> # huge files.
> while 1:
>         line = sys.stdin.readline()
>         if not line:
>                 break
>         new = ''
>         start = 0
>         for index in indecies[sys.argv[1]]:
>                 new = new + strip(line[start:start + index])+'\t'

>                 start = start + index
>         print new
>
> So it reads in a line, then iterates over the list in he
corresponding
> dictionary, and prints stripped substrings extracted according the
the
> field widths in the list, printing a tab between each, then grabs a
new
> line and does it again.
>
> Any suggestions on how to speed this up?

1. Build the new (output) line by appending to a list and then joining
with \t.

new = []
for whatever:
  new.append(field)
print '\t'.join(new)

This is linear rather than quadratic in num fields and should speed
things up in itself.

2. for line in file.xreadlines() may be faster if using new enough
Python.  I believe it reads in large blocks and then peels off lines.
Otherwise, try to find code on groups.google which does same.  IE,
read in megabyte (for instance) split into lines, deliver 1 at a time,
and paste leftover piece onto beginning of next block.  readline() is
known to be slow.

3. If fields have spaces between and none within, break up lines with
split().  Otherwise, preprocessing index (pl. indices or maybe
indexes, but not what you wrote) into tuple of duples (pairs) with
start and stop for each field might or might not be faster.  IE.

fields = process ([3,6,2,2,8,1,8]) #
((0,3),(3,9),(9,11),(11,13),(13,21),(21,22),(22,30))
for start,stop in fields:
   new.append(strip(line[start:stop]))

Even better, carry preprocessing idea further and replace simple list
of field widths with explicit expression for new line field sequence
that unrolls inner loop.  ie,

forms = {
'cob': lambda x: (strip(x[0:3], strip(x[3:9], ..., strip(x[22:30]),
...}

(This conversion can be done with a fairly simple function.)  Then,
your program reduces to

new = forms['cob']
for line in file.xreadlines():
  print '\t'.join(new(line))

If fields do not have spaces within, it might be faster to build lines
and delete spaces all at once with translate after joining.

Good luck.

Terry J. Reedy