Using python to delta-load files into a central DB

Chris Nethery gilcneth at earthlink.net
Fri Apr 13 10:21:15 EDT 2007


Gabriel,

I think that would work well.  Also, thank you for suggesting the use of 
filecmp.  I have never used this module, but it looks like a much better 
solution than what I had been doing previously--using os.stat and performing 
a DB lookup in order to verify that the filename and timestamp existed in a 
'file update' table.  Also, if the only limitation to difflib is that both 
files reside in memory, I should be fine.  The largest of all of these files 
is just over 200k, which should be fine.  If memory serves me right, I can't 
use more than 4MB, so I should be fine.  And, if I spawn separate processes 
for generating the delta files, I should be able to speed things up even 
more.

Thanks again for your help!


Best Regards,

Christopher Nethery



"Gabriel Genellina" <gagsl-py2 at yahoo.com.ar> wrote in message 
news:mailman.6444.1176436788.32031.python-list at python.org...
> En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gilcneth at earthlink.net> 
> escribió:
>
>> Yes, they are tab-delimited text files that will change very little
>> throughout the day.
>> But, this is messy, antiquated 80s junk, nonetheless.
>
> Ugh...
>
>> Rows are designated either by a row type or they contain a comment.  Each
>> row type has an identity value, but the 'comment' rows do not.  The 
>> comment
>> rows, however, are logically associated with the last occurring row type.
>> When I generate my bulk insert file, I add the identity of the last
>> occurring row type to the comment rows, and generate and populate an
>> additional identity column in order to retain the order of the comments.
>> Generally rows will either be added or changed, but sometimes rows will 
>> be
>> removed.  Typically, only 1-5 new rows will be added to a file in a given
>> day, but users sometimes make manual corrections/deletions to older rows 
>> and
>> sometimes certain column values are recalculated.
>
> http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html
>
> You could keep a copy of all files - let's say, as they were yesterday.
> When you want to process the changes, iterate over all files and see if 
> they are newer than your copy. You could use the filecmp module: 
> http://docs.python.org/lib/module-filecmp.html
> For each modified file: load it, and process the comments adding the 
> associated row type and the identity. Just do the same with the 
> "yesterday" file. (I assume they're not so big that you can keep both in 
> memory). You have then two lists of lines; then, use the functions in 
> module difflib to detect the changed lines; based on those results, 
> generate your database inserts/deletes/updates.
>
> This way you will not process the unchanged files, and inside each file, 
> you will ignore unchanged lines. At least in principle it should be faster 
> than redoing all from scratch each time...
>
>> Did I mention that the header contains another implied hierarchy?
>> Fortunately, I can just ignore it and strip it off.
>
> good - I imagine it's enough work as it is now...
>
> -- 
> Gabriel Genellina
> 





More information about the Python-list mailing list