Fixing socket.makefile()

Bryan Olson fakeaddress at nowhere.org
Mon Aug 9 23:58:13 EDT 2004


Here's the problem: Suppose we use:

     import socket
     [...]
     f = some_socket.makefile()

Then:

     f.read() is efficient, but verbose, and incorrect (or at
     least does not play will with others);

     f.readline() is correct, but verbose and inefficient.

To justify the "verbose" part, just look at the code in the
Python library's socket.py.  Below, I'll explain playing well
with others, and then (in)efficiency.

Consider the operations:

     f = some_socket.makefile()
     ch = f.read(1)
     print "The first char is", ch
     ch = some_socket.recv(1)
     print "The second char is", ch

The code above does *not* (usually) print the first and second
characters from the socket.

The problem is that makefile() returns a Python object that has
its own local buffer.  The recv() call reads directly from the
socket, oblivious to any data queued in the file object's
buffer.  The problem is not limited to recv(); select(), and
perhaps other calls, will ignore the buffer and look directly at
the socket.  Output buffering appears to have a similar problem.

Now look up socket.makefile().readline().  It gets one byte at a
time. It will get the byte from the Python buffer if the buffer
is non-empty, otherwise it will try to recv() one byte at a
time, directly from the socket.  By itself, readline() never
over-reads the socket; if select() and recv() would work
correctly before the readline(), they'll work after.  While
correct, reading one byte at a time is painfully slow.

The Python Library Reference is silent on whether the
socket.makefile operations  are supposed to interact correctly
with the direct socket operations.  If they are supposed to play
well together, then read() is wrong.  If they are not, then
readline() is absurdly slow.

Enough of my whining.  The good news is that we can have both
efficiency and correctness, and we can fix the bloat at the same
time.  Operating systems already do efficient buffering for
sockets.  That efficiency varies, but any smart operating system
copies buffers to user-space in large chunks, and answers
recv()'s from the buffers without system calls, when possible.
Python's socket module now supports MSG_PEEK, which enables
Python code to examine a socket's native buffer.

Below my sig, I show code to replace the corresponding member
functions in the class socket._fileobject.  The updated version
passes the tests in test_socket.py.

Make sense?  Worth doing?  I thought I'd talk it up here before
jumping into the devel list.


-- 
--Bryan



# class _fileobject(object):

     def __init__(self, sock, mode='rb', bufsize=-1):
         self._sock = sock
         if bufsize <= 0:
             bufsize = self.default_bufsize
         self.bufsize = bufsize
         self.softspace = False

     def read(self, size=-1):
         if size <= 0:
             size = sys.maxint
         blocks = []
         while size > 0:
             b = self._sock.recv(min(size, self.bufsize))
             size -= len(b)
             if not b:
                 break
             blocks.append(b)
         return "".join(blocks)

     def readline(self, size=-1):
         if size < 0:
             size = sys.maxint
         blocks = []
         read_size = min(20, size)
         found = 0
         while size and not found:
             b = self._sock.recv(read_size, MSG_PEEK)
             if not b:
                 break
             found = b.find('\n') + 1
             length = found or len(b)
             size -= length
             blocks.append(self._sock.recv(length))
             read_size = min(read_size * 2, size, self.bufsize)
         return "".join(blocks)

     def write(self, data):
         self._sock.sendall(str(data))

     def writelines(self, lines):
         #  This version mimics the current writelines, which calls
         #  str() on each line, but comments that we should reject
         #  non-string non-buffers.  Let's omit the next line.
         lines = [str(s) for s in lines]
         self._sock.sendall(''.join(lines))

     def flush(self):
         pass




More information about the Python-list mailing list