efficient data loading with Python, is that possible possible?

igor.tatarinov at gmail.com igor.tatarinov at gmail.com
Wed Dec 12 19:44:01 EST 2007


On Dec 12, 4:03 pm, John Machin <sjmac... at lexicon.net> wrote:
> Inside your function
> [you are doing all this inside a function, not at global level in a
> script, aren't you?], do this:
>     from time import mktime, strptime # do this ONCE
>     ...
>     blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
>
> It would help if you told us what platform, what version of Python,
> how much memory, how much swap space, ...
>
> Cheers,
> John

I am using a global 'from time import ...'. I will try to do that
within the
function and see if it makes a difference.

The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
something like that. Python 2.4

Here is some of my code. Tell me what's wrong with it :)

def loadFile(inputFile, loader):
    # .zip files don't work with zlib
    f = popen('zcat ' + inputFile)
    for line in f:
        loader.handleLine(line)
    ...

In Loader class:
def handleLine(self, line):
    # filter out 'wrong' lines
    if not self._dataFormat(line): return

    # add a new output record
    rec = self.result.addRecord()

    for col in self._dataFormat.colFormats:
        value = parseValue(line, col)
        rec[col.attr] = value

And here is parseValue (will using a hash-based dispatch make it much
faster?):

def parseValue(line, col):
    s = line[col.start:col.end+1]
    # no switch in python
    if col.format == ColumnFormat.DATE:
        return Format.parseDate(s)
    if col.format == ColumnFormat.UNSIGNED:
        return Format.parseUnsigned(s)
    if col.format == ColumnFormat.STRING:
        # and-or trick (no x ? y:z in python 2.4)
        return not col.strip and s or rstrip(s)
    if col.format == ColumnFormat.BOOLEAN:
        return s == col.arg and 'Y' or 'N'
    if col.format == ColumnFormat.PRICE:
        return Format.parseUnsigned(s)/100.

And here is Format.parseDate() as an example:
def parseDate(s):
    # missing (infinite) value ?
    if s.startswith('999999') or s.startswith('000000'): return -1
    return int(mktime(strptime(s, "%y%m%d")))

Hopefully, this should be enough to tell what's wrong with my code.

Thanks again,
igor



More information about the Python-list mailing list