python file synchronization

Wed Feb 8 04:59:36 EST 2012

[ Please reply inline; it makes the discussion read like a converation,
  with context. - Cameron
]

On 08Feb2012 08:57, Sherif Shehab Aldin <silentquote at gmail.com> wrote:
| Thanks a lot for your help, I just forgot to state that the FTP server is
| not under my command, I can't control how the file grow, or how the records
| are added, I can only login to It, copy the whole file.

Oh. That's a pity.

| The reason why I am parsing the file and trying to get the diffs between
| the new file and the old one, and copy it to new_file.time_stamp is that I
| need to cut down the file size so when server (C) grabs the file, It grabs
| only new data, also to cut down the network bandwidth.

Can a simple byte count help here? Copy the whole file with FTP. From
the new copy, extract the bytes from the last byte count offset onward.
Then parse the smaller file, extracting whole records for use by (C).
That way you can just keep the unparsed tail (partial record I imagine)
around for the next fetch.

Looking at RFC959 (the FTP protocol):

  http://www.w3.org/Protocols/rfc959/4_FileTransfer.html

it looks like you can do a partial file fetch, also, by issuing a REST
(restart) command to set a file offset and then issuing a RETR (retrieve)
command to get the rest of the file. These all need to be in binary mode
of course.

So in principle you could track the byte offset of what you have fetched
with FTP so far, and fetch only what is new.

| One of my problems was after mounting server (B) diffs_dir into Server (A)
| throw NFS, I used to create filename.lock first into server (B) local file
| then start copy filename to server (B) then remove filename.lock, so when
| the daemon running on server (C) parses the files in the local_diffs dir,
| ignores the files that are still being copied,
| 
| After searching more yesterday, I found that local mv is atomic, so instead
| of creating the lock files, I will copy the new diffs to tmp dir, and after
| the copy is over, mv it to actual diffs dir, that will avoid reading It
| while It's still being copied.

Yes, this sounds good. Provided the mv is on the same filesystem.

For example: "mv /tmp/foo /home/username/foo" is actually a copy and not
a rename because /tmp is normally a different filesystem from /home.

| Sorry if the above is bit confusing, the system is bit complex.

Complex systems often need fiddly solutions.

| Also there is one more factor that confuses me, I am so bad in testing, and
| I am trying to start actually implement unit testing to test my code, what
| I find hard is how to test code like the one that do the copy, mv and so,
| also the code that fetch data from the web.

Ha. I used to be very bad at testing, now I am improving and am merely
weak.

One approach to testing is to make a mock up of the other half of the
system, and test against the mockup.

For example, you have code to FTP new data and then feed it to (C). You
don't control the server side of the FTP. So you might make a small mock
up program that writes valid (but fictitious) data records progressively
to a local data file (write record, flush, pause briefly, etc). If you
can FTP to your own test machine you could then treat _that_ growing
file as the remote server's data file.

Then you could copy it progressively using a byte count to keep track of
the bits you have seen to skip them, and the the

If you can't FTP to your test system, you could abstract out the "fetch
part of this file by FTP" into its own function. Write an equivalent
function that fetches part of a local file just by opening it.

Then you could use the local file version in a test that doesn't
actually do the FTP, but could exercise the rest of it.

It is also useful to make simple tests of small pieces of the code.
So make the code to get part of the data a simple function, and write
tests to execute it in a few ways (no new data, part of a record,
several records etc).

There are many people better than I to teach testing.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Testing can show the presence of bugs, but not their absence.   - Dijkstra