python file synchronization

Tue Feb 7 22:40:25 EST 2012

On 07Feb2012 01:33, silentnights <silentquote at gmail.com> wrote:
| I have the following problem, I have an appliance (A) which generates
| records and write them into file (X), the appliance is accessible
| throw ftp from a server (B). I have another central server (C) that
| runs a Django App, that I need to get continuously the records from
| file (A).
| 
| The problems are as follows:
| 1. (A) is heavily writing to the file, so copying the file will result
| of uncompleted line at the end.
| 2. I have many (A)s and (B)s  that I need to get the data from.
| 3. I can't afford losing any records from file (X)
[...]
| The above is implemented and working, the problem is that It required
| so many syncs and has a high overhead and It's hard to debug.

Yep.

I would change the file discipline. Accept that FTP is slow and has no
locking. Accept that reading records from an actively growing file is
often tricky and sometimes impossible depending on the record format.
So don't. Hand off completed files regularly and keep the incomplete
file small.

Have (A) write records to a file whose name clearly shows the file to be
incomplete. Eg "data.new". Every so often (even once a second), _if_ the
file is not empty: close it, _rename_ to "data.timestamp" or
"data.sequence-number", open a new "data.new" for new records. 

Have the FTP client fetch only the completed files.

You can perform a similar effort for the socket daemon: look only for
completed data files. Reading the filenames from a directory is very
fast if you don't stat() them (i.e. just os.listdir). Just open and scan
any new files that appear.

That would be my first cut.
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Performing random acts of moral ambiguity.
        - Jeff Miller <jxmill2 at gonix.com>