[Tutor] File transfer

Steven D'Aprano steve at pearwood.info
Sun Oct 31 23:42:45 CET 2010


Chris King quoted Corey Richardson:

>  On 10/31/2010 12:03 PM, Corey Richardson wrote:
[...]
>> To read from a file, you open it, and then read() it into a string 
>> like this:
>> for line in file:
>>     string += string + file.readline()

Aiieeee! Worst way to read from a file *EVAR*!!!

Seriously. Don't do this. This is just *wrong*.

(1) You're mixing file iteration (which already reads line by line) with 
readline(), which would end in every second line going missing. 
Fortunately Python doesn't let you do this:

 >>> for line in file:
...     print file.readline()
...
Traceback (most recent call last):
   File "<stdin>", line 2, in <module>
ValueError: Mixing iteration and read methods would lose data


(2) Even if Python let you do it, it would be slow. Never write any loop 
with repeated string concatenation:

result = ''
for string in list_of_strings:
     result += string  # No! Never do this! BAD BAD BAD!!!

This can, depending on the version of Python, the operating system, and 
various other factors, end up being thousands of times slower than the 
alternative:

result = ''.join(list_of_strings)

I'm not exaggerating. There was a bug reported in the Python HTTP 
library about six(?) months ago where Python was taking half an hour to 
read a file that Internet Explorer or wget could read in under a second.

You might be lucky and never notice the poor performance, but one of 
your users will. This is a Shlemiel the Painter algorithm:

http://www.joelonsoftware.com/articles/fog0000000319.html

Under some circumstances, *some* versions of Python can correct for the 
poor performance and optimize it to run quickly, but not all versions, 
and even the ones that do sometimes run into operating system dependent 
problems that lead to terrible performance. Don't write Shlemiel the 
Painter code.

The right way to read chunks of data from a file is with the read method:

fp = open("filename", "rb")  # open in binary mode
data = fp.read()  # read the whole file
fp.close()

If the file is large, and you want to read it in small chunks, read() 
takes a number of optional arguments including how many bytes to read:

fp.read(64)  # read 64 bytes

If you want to read text files in lines, you can use the readline() 
method, which reads up to and including the next end of line; or 
readlines() which returns a list of each line; or just iterate over the 
file to get

Chris King went on to ask:

> I don't think readline will work an image. How do you get raw binary 
> from a zip? Also make sure you do reply to the tutor list too, not just me.

readline() works fine on binary files, including images, but it won't be 
useful because binary files aren't split into lines.

readline() reads until end-of-line, which varies according to the 
operating system you are running, but often is \n. A binary file may or 
may not contain any end-of-line characters. If it does, then readline() 
will read up to the next EOL perfectly fine:

f.readline()
=> '\x23\x01\0=#%\xff\n'

and if it doesn't, readline() will happily read the entire file all the 
way to the end:

'\x23\x01\0=#%\xff3m.\x02\0\xa0\0\0\0+)\0\x03c!<ft\0\xc2|\x8e~\0...'


To read a zip file as raw data, just open it as a regular binary file:

f = open("data.zip", "rb")

But this is the wrong way to solve the problem of transferring files 
from one computer to another. The right way is to use a transport 
protocol that already works, something like FTP or HTTP. The only reason 
for dealing with files as bytes is if you want to create your own file 
transport protocol.



-- 
Steven



More information about the Tutor mailing list