efficient data loading with Python, is that possible possible?

DouhetSukd DouhetSukd at gmail.com
Wed Dec 12 22:04:41 EST 2007


Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files
and doing a 'fancy' diff between the 2, in about 60 seconds.  Granted,
your file is likely bigger, but so is modern hardware and 20 mins does
seem a bit high.

Can't talk about the rest of your code, but some parts of it may be
optimized

def parseValue(line, col):
    s = line[col.start:col.end+1]
    # no switch in python
    if col.format == ColumnFormat.DATE:
        return Format.parseDate(s)
    if col.format == ColumnFormat.UNSIGNED:
        return Format.parseUnsigned(s)

How about taking the big if clause out?  That would require making all
the formatters into functions, rather than in-lining some of them, but
it may clean things up.

#prebuilding a lookup of functions vs. expected formats...
#This is done once.
#Remember, you have to position this dict's computation _after_ all
the Format.parseXXX declarations.  Don't worry, Python _will_ complain
if you don't.

dict_format_func = {ColumnFormat.DATE:Format.parseDate,
                    ColumnFormat.UNSIGNED:Format.parseUnsigned,
                    ....

def parseValue(line, col):
    s = line[col.start:col.end+1]

    #get applicable function, apply it to s
    return dict_format_func[col.format](s)

Also...

     if col.format == ColumnFormat.STRING:
        # and-or trick (no x ? y:z in python 2.4)
        return not col.strip and s or rstrip(s)

Watch out!  'col.strip' here is not the result of stripping the
column, it is the strip _function_ itself, bound to the col object, so
it always be true.  I get caught by those things all the time :-(

I agree that taking out the dot.dot.dots would help, but I wouldn't
expect it to matter that much, unless it was in an incredibly tight
loop.

I might be that.

     if s.startswith('999999') or s.startswith('000000'): return -1

     would be better as...

#outside of loop, define a set of values for which you want to return
-1
set_return = set(['999999','000000'])

#lookup first 6 chars in your set
 def parseDate(s):
    if s[0:6] in set_return:
       return -1
    return int(mktime(strptime(s, "%y%m%d")))

Bottom line:  Python built-in data objects, such as dictionaries and
sets, are very much optimized.  Relying on them, rather than writing a
lot of ifs and doing weird data structure manipulations in Python
itself, is a good approach to try.  Try to build those objects outside
of your main processing loops.

Cheers

Douhet-did-suck




More information about the Python-list mailing list