efficient data loading with Python, is that possible possible?

Thu Dec 13 10:09:05 EST 2007

On 2007-12-13, igor.tatarinov at gmail.com <igor.tatarinov at gmail.com> wrote:
> On Dec 12, 4:03 pm, John Machin <sjmac... at lexicon.net> wrote:
>> Inside your function
>> [you are doing all this inside a function, not at global level in a
>> script, aren't you?], do this:
>>     from time import mktime, strptime # do this ONCE
>>     ...
>>     blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
>>
>> It would help if you told us what platform, what version of Python,
>> how much memory, how much swap space, ...
>>
>> Cheers,
>> John
>
> I am using a global 'from time import ...'. I will try to do that
> within the
> function and see if it makes a difference.
>
> The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
> something like that. Python 2.4
>
> Here is some of my code. Tell me what's wrong with it :)
>
> def loadFile(inputFile, loader):
>     # .zip files don't work with zlib
>     f = popen('zcat ' + inputFile)
>     for line in f:
>         loader.handleLine(line)
>     ...
>
> In Loader class:
> def handleLine(self, line):
>     # filter out 'wrong' lines
>     if not self._dataFormat(line): return
>
>     # add a new output record
>     rec = self.result.addRecord()
>
>     for col in self._dataFormat.colFormats:
>         value = parseValue(line, col)
>         rec[col.attr] = value
>
> def parseValue(line, col):
>     s = line[col.start:col.end+1]
>     # no switch in python
>     if col.format == ColumnFormat.DATE:
>         return Format.parseDate(s)
>     if col.format == ColumnFormat.UNSIGNED:
>         return Format.parseUnsigned(s)
>     if col.format == ColumnFormat.STRING:
>         # and-or trick (no x ? y:z in python 2.4)
>         return not col.strip and s or rstrip(s)
>     if col.format == ColumnFormat.BOOLEAN:
>         return s == col.arg and 'Y' or 'N'
>     if col.format == ColumnFormat.PRICE:
>         return Format.parseUnsigned(s)/100.
>
> And here is Format.parseDate() as an example:
> def parseDate(s):
>     # missing (infinite) value ?
>     if s.startswith('999999') or s.startswith('000000'): return -1
>     return int(mktime(strptime(s, "%y%m%d")))

An inefficient parsing technique is probably to blame. You first
inspect the line to make sure it is valid, then you inspect it
(number of column type) times to discover what data type it
contains, and then you inspect it *again* to finally translate
it.

> And here is parseValue (will using a hash-based dispatch make
> it much faster?):

Not much.

You should be able to validate, recognize and translate all in
one pass. Get pyparsing to help, if need be.

What does your data look like?

-- 
Neil Cerutti