python file synchronization

Wed Feb 8 01:57:43 EST 2012

Hi Cameron,

Thanks a lot for your help, I just forgot to state that the FTP server is
not under my command, I can't control how the file grow, or how the records
are added, I can only login to It, copy the whole file.

The reason why I am parsing the file and trying to get the diffs between
the new file and the old one, and copy it to new_file.time_stamp is that I
need to cut down the file size so when server (C) grabs the file, It grabs
only new data, also to cut down the network bandwidth.

One of my problems was after mounting server (B) diffs_dir into Server (A)
throw NFS, I used to create filename.lock first into server (B) local file
then start copy filename to server (B) then remove filename.lock, so when
the daemon running on server (C) parses the files in the local_diffs dir,
ignores the files that are still being copied,

After searching more yesterday, I found that local mv is atomic, so instead
of creating the lock files, I will copy the new diffs to tmp dir, and after
the copy is over, mv it to actual diffs dir, that will avoid reading It
while It's still being copied.

Sorry if the above is bit confusing, the system is bit complex.

Also there is one more factor that confuses me, I am so bad in testing, and
I am trying to start actually implement unit testing to test my code, what
I find hard is how to test code like the one that do the copy, mv and so,
also the code that fetch data from the web.

On Wed, Feb 8, 2012 at 5:40 AM, Cameron Simpson <cs at zip.com.au> wrote:

> On 07Feb2012 01:33, silentnights <silentquote at gmail.com> wrote:
> | I have the following problem, I have an appliance (A) which generates
> | records and write them into file (X), the appliance is accessible
> | throw ftp from a server (B). I have another central server (C) that
> | runs a Django App, that I need to get continuously the records from
> | file (A).
> |
> | The problems are as follows:
> | 1. (A) is heavily writing to the file, so copying the file will result
> | of uncompleted line at the end.
> | 2. I have many (A)s and (B)s  that I need to get the data from.
> | 3. I can't afford losing any records from file (X)
> [...]
> | The above is implemented and working, the problem is that It required
> | so many syncs and has a high overhead and It's hard to debug.
>
> Yep.
>
> I would change the file discipline. Accept that FTP is slow and has no
> locking. Accept that reading records from an actively growing file is
> often tricky and sometimes impossible depending on the record format.
> So don't. Hand off completed files regularly and keep the incomplete
> file small.
>
> Have (A) write records to a file whose name clearly shows the file to be
> incomplete. Eg "data.new". Every so often (even once a second), _if_ the
> file is not empty: close it, _rename_ to "data.timestamp" or
> "data.sequence-number", open a new "data.new" for new records.
>
> Have the FTP client fetch only the completed files.
>
> You can perform a similar effort for the socket daemon: look only for
> completed data files. Reading the filenames from a directory is very
> fast if you don't stat() them (i.e. just os.listdir). Just open and scan
> any new files that appear.
>
> That would be my first cut.
> --
> Cameron Simpson <cs at zip.com.au> DoD#743
> http://www.cskk.ezoshosting.com/cs/
>
> Performing random acts of moral ambiguity.
>        - Jeff Miller <jxmill2 at gonix.com>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120208/39383e35/attachment-0001.html>