efficient data loading with Python, is that possible possible?

John Machin sjmachin at lexicon.net
Wed Dec 12 20:27:43 EST 2007


On Dec 13, 11:44 am, igor.tatari... at gmail.com wrote:
> On Dec 12, 4:03 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > Inside your function
> > [you are doing all this inside a function, not at global level in a
> > script, aren't you?], do this:
> >     from time import mktime, strptime # do this ONCE
> >     ...
> >     blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
>
> > It would help if you told us what platform, what version of Python,
> > how much memory, how much swap space, ...
>
> > Cheers,
> > John
>
> I am using a global 'from time import ...'. I will try to do that
> within the
> function and see if it makes a difference.
>
> The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
> something like that. Python 2.4
>
> Here is some of my code. Tell me what's wrong with it :)
>
> def loadFile(inputFile, loader):
>     # .zip files don't work with zlib
>     f = popen('zcat ' + inputFile)
>     for line in f:
>         loader.handleLine(line)
>     ...
>
> In Loader class:
> def handleLine(self, line):
>     # filter out 'wrong' lines
>     if not self._dataFormat(line): return
>
>     # add a new output record
>     rec = self.result.addRecord()
>
>     for col in self._dataFormat.colFormats:
>         value = parseValue(line, col)
>         rec[col.attr] = value
>
> And here is parseValue (will using a hash-based dispatch make it much
> faster?):
>
> def parseValue(line, col):
>     s = line[col.start:col.end+1]
>     # no switch in python
>     if col.format == ColumnFormat.DATE:
>         return Format.parseDate(s)
>     if col.format == ColumnFormat.UNSIGNED:
>         return Format.parseUnsigned(s)
>     if col.format == ColumnFormat.STRING:
>         # and-or trick (no x ? y:z in python 2.4)
>         return not col.strip and s or rstrip(s)
>     if col.format == ColumnFormat.BOOLEAN:
>         return s == col.arg and 'Y' or 'N'
>     if col.format == ColumnFormat.PRICE:
>         return Format.parseUnsigned(s)/100.
>
> And here is Format.parseDate() as an example:
> def parseDate(s):
>     # missing (infinite) value ?
>     if s.startswith('999999') or s.startswith('000000'): return -1
>     return int(mktime(strptime(s, "%y%m%d")))
>
> Hopefully, this should be enough to tell what's wrong with my code.
>

I have to go out now, so here's a quick overview: too many goddam dots
and too many goddam method calls.
1. do
   colfmt = col.format # ONCE
   if colfmt == ...
2. No switch so put most frequent at the top
3. What is ColumnFormat? What is Format? I think you have gone class-
crazy, and there's more overhead than working code ...

Cheers,
John



More information about the Python-list mailing list