[Python-3000] revamping the io stack, part 2

Sat Apr 29 21:10:15 CEST 2006

i first thought on focusing on the socket module, because it's the part that
bothers me most, but since people have expressed their thoughts on
completely
revamping the IO stack, perhaps we should be open to adopting new ideas,
mainly from the java/.NET world (keeping the momentum from the previous
post).

there is an inevitable issue of performance here, since it basically splits
what used to be "file" or "socket" into many layers... each adding
additional
overhead, so many parts should be lowered to C.

if we look at java/.NET for guidance, they have come up with two concepts:
* stream - an arbitrary, usually sequential, byte data source
* readers and writers - the way data is encoded into/decoded from the
stream.
we'll use the term "codec" for these readers and writers in general.

so "stream" is the "where" and "codec" is the "how", and the concept of
codecs is not limited to ASCII vs UTF-8. it can grow into fully-fledged
protocols.

- - - - - - -
Streams
- - - - - - -

streams provide an interface to data sources, like memory, files, pipes, or
sockets. the basic interface of all of these is

class Stream:
    def close(self)
    def read(self, count)
    def readall(self)
    def write(self, data)

and unlike today's files and sockets, when you read from a broken socket or
past the end of the file, you get EOFError.

read(x) guarantees to return x bytes, or EOFError otherwise (and also
restoing
the stream position). on the other hand, readall() makes no such guarantee:
it
reads all the data up to EOF, and if you readall() from EOF, you get "".

perhaps readall() should return all *available* data, not necessarily up to
EOF. for files, this is equivalent, but for sockets, readall would return
all
the data that sits in the network stack. this could be a nice way to do
non-blocking IO.

and if we do that already, perhaps we should introduce async operations as a
built-in feature? .NET does (BeginRead, EndRead, etc.)
    def async_read(self, count, callback)
    def async_write(self, data, callback)

i'm not sure about these two, but it does seem like a good path to follow.

-----

another issue is the current class hierarchy: fileno, seek, and readline are
meaningless in many situations, yet they are considered the part of the
file-
protocol (take a look at StringIO implementing isatty!).

these methods, which may be meaningless for several types of streams, must
not be part of the base Stream class.

for example, only FileStream and MemoryStream are seekable, so why have seek
as part of the base Stream class?

-----

streams that don't rely on an operating-system resource, would derive
directly
from Stream. as examples for such streams, we can condier

class MemoryStream(Stream):
    # like today's StringIO
    # allows seeking

class RandomStream(Stream):
    # provider of random data

-----

on the other hand, streams that rely on operating-system resources, like
files
or sockets, would derive from

class OSStream(Stream):
    def isatty(self)
    def fileno(self) # for select()
    def dup(self)

and there are several examples for this kind:

FileStream is the entity that works with files, instead of the file/open
class of today. since files provide random-access (seek/tell), this kind of
stream is "seekable" and "tellable".

class FileStream(OSStream):
    def __init__(self, filename, mode = "r")
    def seek(self, pos, offset = None)
    def tell(self)
    def set_size(self, size)
    def get_size(self)

although i prefer properties instead
    position = property(tell, seek)
    size = property(get_size, set_size)

PipeStream represents a stream over a (simplex) pipe:

class PipeStream(OSStream):
    def get_mode(self) # read or write

DuplexPipeStream is an abstraction layer that uses two simplex pipes
as a full-duplex stream:

class DuplexPipeStream(OSStream):
    def __init__(self, incoming, outgoing):

    @classmethod
    def open(cls):
        incoming, outgoing = os.pipe()
        return cls(incoming, outgoing)

NetworkStreams provide a stream over a socket. unlike files, sockets may
get quite complicated (options, accept, bind), so we keep the distinction:
* sockets as the underlying "physical resource"
* NetworkStreams wrap them with a nice stream interface. for example, while
socket.recv(x) may return less than x bytes, networkstream.read(x) returns x
bytes.

we must keep this distinction because streams are *data sources*, and
there's
no way to represent things like bind or accept in a data source. only client

(connected) sockets would be wrappable by NetworkStream. server sockets
don't
provide data and hence have nothing to do with streams.

class NetworkStream(OSStream):
    def __init__(self, sock)

- - - - - - - - -
Special Streams
- - - - - - - - -

it will also be useful to have a way to duplicate a stream, like the unix
tee
command does

class TeeStream(Stream):
    def __init__(self, src_stream, dst_stream)

f1 = FileStream("c:\\blah")
f2 = FileStream("c:\\yaddah")
f1 = TeeStream(f1, f2)

f1.write("hello")

will write "hello" to f2 as well. that's useful for monitoring/debugging,
like echoing everything from a NetworkStream to a file, so you could debug
it easily.

-----

buffering is always *explicit* and implemented at the interpreter level,
rather than by libc, so it is consistent between all platforms and streams.
all streams, by nature, and *non-buffered* (write the data as soon as
possible). buffering wraps an underlying stream, making it explicit

class BufferedStream(Stream):
    def __init__(self, stream, bufsize)
    def flush(self)

(BufferedStream appears in .NET)

class LineBufferedStream(BufferedStream):
    def __init__(self, stream, flush_on = b"\n")

f = LineBufferedStream(FileStream("c:\\blah"))

where flush_on specifies the byte (or sequence of bytes?) to flush upon
writing. by default it would be on newline.

- - - - - - -
Codecs
- - - - - - -

as was said earlier, formatting defines how the data (or arbitrary objects)
are
to be encoded into and decoded from a stream.

class StreamCodec:
    def __init__(self, stream)
    def write(self, ...)
    def read(self, ...)

for example, in order to serialize binary records into a file, you would use

class StructCodec(StreamCodec):
    def __init__(self, stream, format):
        Codec.__init__(self, stream)
        self.format = format
    def write(self, *args):
        self.stream.write(struct.pack(self.format, *args))
    def read(self):
        size = struct.calcsize(self.format)
        data = self.stream.read(size)
        return struct.unpack(self.format, data)

(similar to BinaryReader/BinaryWriter in .NET)

and for working with text, you would have

class TextCodec(StreamCodec):
    def __init__(self, stream, textcodec = "utf-8"):
        Codec.__init__(self, stream)
        self.textcodec = textcodec
    def write(self, data):
        self.stream.write(data.encode(self.textcodec))
    def read(self, length):
        return self.stream.read(length).decode(self.textcodec)

    def __iter__(self) # iter by lines
    def readline(self) # read the next line
    def writeline(self, data) # write a line

as you can see, only the TextCodec adds the readline/writeline methods, as
they are meaningless to most binary formats. the stream itself has no notion

of a line.

<big drum roll> no more newline issues! </big drum roll>

the TextCodec will do the translation for you. all newlines are \n in
python,
and are written to the underlying stream in a way that would please the
underlying platform.

so the "rb" and "wb" file modes will deminish, and instead you would wrap
the
FileStream with a TextCodec. it's explicit, so you won't be able to corrupt
data accidentally.

-----

it's worth to note that in .NET (and perhaps java as well), they splitted
TextCodec into two parts, the TextReader and TextWriter classes, which you
initialize over a stream:

f = new FileStream("c:\\blah");
sr = new StreamReader(f, Encoding.UTF8);
sw = new StreamWriter(f, Encoding.UTF8);
sw.Write("hello");
f.Position = 0;
sr.read(5);

but why separate the two? it could only cause problems, as you may
initialize
them with different encodings, which leads to no good. under the guidelines
of this suggestion, it would be implemented this way:

f = TextCodec(FileStream("c:\\blah"), "utf-8")

which can of course be refactored to a function:

def textfile(filename, mode = "r", codec = "utf-8"):
    return TextCodec(FileStream(filename, mode), codec)

for line in textfile("c:\\blah"):
    print line

unlike today's file objects, FileStream objects don't know about lines, so
you can't iterate through a file directly. it's quite logical if you think
about it, as there's no meaning to iterating over a binary file by lines.
it's a feature of text files.

-----

many times, especially in network protocols, you need framing for
transfering
frames/packets/messages over a stream. so a very useful FramingCodec can be
introduced:

class FramingCodec(Codec):
    def write(self, data):
        self.stream.write(struct.pack("<L", len(data)))
        self.stream.write(data)
    def read(self):
        length, = struct.unpack("<L", self.stream.read(4))
        return self.stream.read(length)

once you set up such a connection, you are free of socket hassle:

conn = FramingCodec(NetworkStream(TcpClientSocket("host", 1234)))
conn.write("hello")
reply = conn.read()

and it can be extended by subclassing, for instance, to allow serializing
streams: you can write objects directly to the stream and get them on the
other side with ease:

class SeralizingCodec(FramingCodec):
    def write(self, obj):
        FramingCodec.write(self, pickle.dumps(obj))
    def read(self):
        return pickle.loads(FramingCodec.read(self))

conn = SeralizingCodec(NetworkStream(TcpClientSocket("host", 1234)))
conn.send([1,2,{3:4}])
person = conn.recv()
print person.first_name

and it can serve as the basis for RPC protocols or as a simple way to
transfer
arbitrary objects (for example, database query results from a server, etc.)

and since the codecs don't care what the underlying stream is, it can be a
FileStream as well, serializing objects to disk.

-----

many protocols can also be represented as codecs. textual protocols, like
HTTP or SMTP, can be easily implemented that way:

class HttpClientCodec( *TextCodec* ):
    def __init__(self, stream):
        TextCodec.__init__(self, stream, textcodec = "ascii")

    def write(self, request, params, data = ""):
        self.writeline("%s %s" % (request, params))
        self.writeline()
        if data:
            self.writeline(data)

    def read(self):
        ...
        return response, header, data

    def do_get(filename):
        self.write("GET", filename)

    def do_post(filename, data):
        self.write("POST", filename, data)

class HttpServerCodec(TextCodec):
    ....

and then an http-server becomes rather simple:

# client
conn = HttpClientCodec(NetworkStream(TcpClientSocket("host", 8080)))
conn.do_get("/index.html")
response, header, data = conn.recv()
if response == "200":
    print data

# server
s = TcpServerSocket(("", 8080))
client_sock = s.accept()
conn = HttpServerCodec(NetworkStream(client_sock))
request, params, data = conn.read()

if request == "GET":
    ...

you can write something like urllib in no-time.

-----

it's worth to note that codecs are "stackable", so you can chain them, thus
creating more complex codecs, for instance:

https_conn = HttpClientCodec(SslCodec(NetworkStream(...)))

and other crazy stuff can follow: imaging doing SSL authentication over
pipes,
between two processes. why only sockets? yeah, it's crazy, but why not?

- - - - - - -
Summary
- - - - - - -

to conclude this long post, streams are generic data providers (random,
files,
sockets, in-memory), and codecs provide an abstraction layer over streams,
allowing sophisticated use cases (text, binary records, framing, and even
full protocols).

i've implemented some of these ideas in RPyC ( http://rpyc.wikispaces.com ),
in the Stream and Channel modules (i needed a uniform way of working with
pipes and sockets). of course i didn't go rewriting the whole io stack
there,
but it shows real-life usage of this model.

-tomer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060429/bda23273/attachment-0001.html