shutil.copyfile is incomplete (truncated)

Roy Smith roy at panix.com
Fri Apr 12 10:47:31 EDT 2013


In article <mailman.506.1365751267.3114.python-list at python.org>,
 Rob Schneider <rmschne at gmail.com> wrote:

> Source (correct one) is 47,970 bytes. Target after copy of 45,056 bytes.  
> I've tried changing what gets written to change the file size. It is usually 
> this sort of difference.
> 
> The file system is Mac OS Extended Journaled (default as out of the box).  

Is it always the tail end of the file that gets truncated, or is it 
missing (or mutating) data in the middle of the file?  I'm just grasping 
at straws here, but maybe it's somehow messing up line endings (turning 
CRLF pairs into just LF), or using some other kind of encoding for 
unicode characters?

If you compare the files with cmp, does it say:

$ cmp original truncated 
cmp: EOF on truncated

that's what I would expect if it's a strict truncation.  If it says 
anything else, you've got a data munging problem.

What I would normally do around this time is run a system call trace on 
the process to watch all the descriptor related (i.e. open, create, 
write) system calls.   On OSX, that means dtruss.  Unfortunately, I'm 
not that familiar with the OSX variant so I can't give you specific 
advice about which options to use.

When you can see the system calls, you know exactly what your process is 
doing.  You should be able to see the output file being opened and a 
descriptor returned, then find all the write() calls to that descriptor.  
You'll also be able to find any other system calls on that pathname 
after the descriptor is closed.

Please report back what you find!

Oh, another trick you might want to try is making the output file path 
/dev/stdout and redirecting the output into a file with the shell.  See 
if that makes any difference.  Or, try something like (assuming the -o 
option to your script sets the output filename):

python my_prog.py -o /dev/stdout | dd bs=1 of=xxx

That will do a couple of things.  First, dd will report how many bytes 
it read and wrote, so you can see if that's the correct number.  Also, 
since your process will no longer be writing to a real file, if anything 
is doing something weird like a seek() after you're done writing, that 
will fail since you can't seek() on a pipe.



More information about the Python-list mailing list