[Python-3000] iostack and sock2

Sun Jun 4 21:45:24 CEST 2006

you certainly have good points there.

i'll start with the easy ones:
>Some things that don't appear to have been considered in the iostack
design yet:
> - non-blocking IO and timeouts (e.g. on NetworkStreams)

NetworkStreams have a readavail() method, which reads all the available
in-queue data, as well as a may_read and a may_write properties

besides, because of the complexity of sockets (so many different
options, protocols, etc), i'd leave the timeout to the socket itself.
i.e.

s = TcpSocket(...)
s.timeout = 2
ns = NetworkStream(s)
ns.read(100)

> - interaction with (replacement of?) the select module

well, it's too hard to design for a nonexisting module. select is all there
is that's platform independent.

random idea:
* select is virtually platform independent
* improved polling is inconsistent
    * kqueue is BSD-only
    * epoll is linux-only
    * windows has none of those

maybe introduce a new select module that has select-objects, like
the Poll() class, that will default to using select(), but could use
kqueue/epoll when possible?

s = Select((sock1, "r"), (sock2, "rw"), (sock3, "x"))
res = s.wait(timeout = 1)
for sock, events in res:
    ....

- - - - -

> The common Stream API should include a flush() write method, so that
> application code doesn't need to care whether or not it is dealing with
> buffered IO when forcing output to be displayed.

i object. it would soon lead to things like today's StingIO, that defines
isatty and flush, although it's completely meaningless. having to
implement functions "just because" is ugly.

i would suggest a different approach -- PseudoLayers. these are
mockup layers that provide a do-nothing function only for interface
consistency. each layer would define it's own pseudo layer, for
example:

class BufferingLayer(Layer):
    def flush(self):
       <implementation>

class PseudoBufferingLayer(Layer):
    def flush(self):
       pass

when you pass an unbuffered stream to a function that expects
it to be buffered (requires flush, etc), you would just wrap it
with the pseudo-layer. this would allow arbitrary mockup APIs
to be defined by users (why should flush be that special?)

- - - - -

> e.g an alternative approach would be to
> define InputStream and OutputStream, and then have an IOStream that inherited
> from both of them).

hrrm... i need to think about this more. one problem i already see:

class InputStream:
   def close(self):....
   def read(self, count): ...

class OutputStream:
   def close(self):....
   def write(self, data)...

class NetworkStream(InputStream, OutputStream):
   ...

which version of close() gets called?

- - - - -

> e.g. the 'position' property is
> probably a bad idea, because x.position may then raise an IOError

i guess it's reasonable approach, but i'm a "usability beats purity" guy.
f.position = 0
or
f.position += 10

is so much more convenient than seek()ing and tell()ing. we can also
optimize += by defining a Position type where __iadd__(n) uses
seek(n, "curr") instead of seek(n + tell(), "start")

btw, you can first test the "seakable" attribute, to see if positioning
would work.

and in the worst case, i'd vote for converting IOErrors to ValueErrors...

def _set_pos(self, n)
    try:
       self.seek(n)
    except IOError:
       raise ValueError("invalid position value", n)

so that
f.position = -10
raises a ValueError, which is logical

- - - - -

> The stream layer hierarchy needs to be limited to layers that both expose and
> use the normal bytes-based Stream API. A separate stream interface concept is
> needed for something that can be used by the application, but cannot have
> other layers stacked on top of it.

yeah, i wanted to do so myself, but couldn't find a good definition to what
is stackable and what's not. but i like the idea. i'll think some more
about that as well.

> The BytesInterface differs from a normal low-level
> stream primarily in the fact that it *is* line-iterable.

but what's a line in a binary file? how is that making sense? binary files
are usually made of records, headers, pointers, arrays of records (tables)...
think of how ELF32 looks like, or a database, or core dumps -- those are
binary files. what would a "line" mean to a .tar.bz2 file?

- - - - -

> Additionally, the 'textfile' helper tries to handle line
> terminators while the data is still bytes, while Unicode defines line endings
> in terms of characters. As I understand it, "\x0A" (CR), "\x0D" (LF),
> [...]

well, currently, the TextLayer reads the stream character by character,
until it finds "\n"... the specific encoding of "\n" depends on the
layer's encoding, but i don't deal with all the weird cases you mentioned.

- - - - -

random idea:
when compiled with universal line support, python unicode should
equate "\n" to any of the forementioned characters.
i.e.

u"\n" == u"\u2028" # True

the fact unicode is stupid shouldn't make programming unicode
as stupid: a newline is a newline!

but then again, it could be solved with a isnewline(ch) function
instead, without messing the internals of the unicode type...
so that's clearly (-1). i just write it "for the record".

- - - - -

> I can see that behaviour being seriously annoying when you get to the end of
> the stream. I'd far prefer for the stream to just give me the last bit when I
> ask for it and then tell me *next* time that there isn't anything left.

well, today it's done like so:

while True:
   x = f.read(100)
   if not x:
      break

in iostack, that would be done like so:

try:
    while True:
        x = f.read(100)
except EOFError:
    last_x = f.readall() # read all the leftovers (0 <= leftovers < 100)

a little longer, but not illogical

> If you want a method with the other behaviour, add a "readexact" API, rather
> than changing the semantics of "read" (although I'd be really curious to hear
> the use case for the other behaviour).

well, when i work with files/sockets, i tend to send data structures over them,
like records, frames, protocols, etc. if a record is said to be x bytes long,
and read(x) returns less than x bytes, my code has to loop until it gets
enough bytes.

for example, a record-codec:

class RecordCodec:
    ....
    def read(self):
        raw = self.substream.read(struct.calcsize(self.format))
        return struct.unpack(self.format, raw)

if substream.read() returns less than the expected number of bytes,
as is the case with sockets, the RecordCodec would have to perform
its own buffering... and it happens in so many places today.
imho, any framework must follow the DRY principal... i wish i could
expand this acronym, but then i'd repeat myself ;)

since the normal use-case for read(n) is expecting n bytes, read(n)
is the standard API, while readany(n) can be used for unknown lengths.
and when your IO library will be packed with useful things like
FramingLayer, or SerializingLayer, you would just use such frames or
whatever to transfer arbitrary lengths of data, without thinking twice.
it would just become the natural way of doing that.

imagine how cool it could be -- SerializingLayer could mean the end of
specializied protocols and statemachines. you just send an object that
could take care of its own (a ChatMessage would have a .show() method,
etc.),

- - - - -

and you still have readany

>>> my_netstream.readany(100)
"hello"

perhaps it should be renamed readupto(n)

as for code that interacts with ugly protocols like HTTP, you could use:

s = TextInterface(my_netstream, "ascii")
header = []
for line in s:
    if not line:
       break
    header.append(line)

- - - - -

thanks for the ideas.

-tomer

On 6/3/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> tomer filiba wrote:
> > hi all
> >
> > some time ago i wrote this huge post about stackable IO and the
> > need for a new socket module. i've made some progress with
> > those, and i'd like to receive feedback.
> >
> > * a working alpha version of the new socket module (sock2) is
> > available for testing and tweaking with at
> > http://sebulba.wikispaces.com/project+sock2
> >
> > * i'm working on a version of iostack... but i don't expect to make
> > a public release until mid july. in the meanwhile, i started a wiki
> > page on my site for it (motivation, plans, design):
> > http://sebulba.wikispaces.com/project+iostack
>
> Nice, very nice.
>
> Some things that don't appear to have been considered in the iostack design yet:
>   - non-blocking IO and timeouts (e.g. on NetworkStreams)
>   - interaction with (replacement of?) the select module
>
> Some other random thoughts about the current writeup:
>
> The design appears to implicitly assume that it is best to treat all streams
> as IO streams, and raise an exception if an output operation is accessed on an
> input-only stream (or vice versa). This seems like a reasonable idea to me,
> but it should be mentioned explicitly (e.g an alternative approach would be to
> define InputStream and OutputStream, and then have an IOStream that inherited
> from both of them).
>
> The common Stream API should include a flush() write method, so that
> application code doesn't need to care whether or not it is dealing with
> buffered IO when forcing output to be displayed.
>
> Any operations that may touch the filesystem or network shouldn't be
> properties - attribute access should never raise IOError (this is a guideline
> that came out of the Path discussion). (e.g. the 'position' property is
> probably a bad idea, because x.position may then raise an IOError)
>
> The stream layer hierarchy needs to be limited to layers that both expose and
> use the normal bytes-based Stream API. A separate stream interface concept is
> needed for something that can be used by the application, but cannot have
> other layers stacked on top of it. Additionally, any "bytes-in-bytes-out"
> transformation operation can be handled as a single codec layer that accepts
> an encoding function and a decoding function. This can then be used for
> compression layers, encryption layers, Golay encoding, A-law companding, AV
> codecs, etc. . .
>
>    StreamLayer
>      * ForwardingLayer - forwards all data written or read to another stream
>      * BufferingLayer - buffers data using given buffer size
>      * CodecLayer - encodes data written, decodes data read
>
>    StreamInterface
>      * TextInterface - text oriented interface to a stream
>      * BytesInterface - byte oriented interface to a stream
>      * RecordInterface - record (struct) oriented interface to a stream
>      * ObjectInterface - object (pickle) oriented interface to a stream
>
> The key point about the stream interfaces is that while they will provide a
> common mechanism for getting at the underlying stream, their interfaces are
> otherwise unconstrained. The BytesInterface differs from a normal low-level
> stream primarily in the fact that it *is* line-iterable.
>
> On the topic of line buffering, the Python 2.x IO stack treats binary files as
> line iterable, using '\n' as a line separator (well, more strictly it's a
> record separator, since we're talking about binary files).
>
> There's actually an RFE on SF somewhere about making the record separator
> configurable in the 2.x IO stack (I raised the tracker item ages ago when
> someone else made the suggestion).
>
> However, the streams produced by iostack's 'file' helper are not currently
> line-iterable. Additionally, the 'textfile' helper tries to handle line
> terminators while the data is still bytes, while Unicode defines line endings
> in terms of characters. As I understand it, "\x0A" (CR), "\x0D" (LF),
> "\x0A\x0D" (CRLF), "\x85" (NEL), "\x0C" (FF), "\u2028" (LS), "\u2029" (PS)
> should all be treated as line terminators as far as Unicode is concerned.
>
> So I think line buffering and making things line iterable should be left to
> the TextInterface and BytesInterface layers. TextInterface would be most
> similar to the currently file interface, only working on Unicode strings
> instead of 8-bit strings (as well as using the Unicode definition of what
> constitutes a line ending). BytesInterface would work with binary files,
> returning a bytes object for each record.
>
> So I'd tweak the helper functions to look like:
>
> def file(filename, mode = "r", bufsize = -1, line_sep="\n"):
>      f = FileStream(filename, mode)
>      # a bufsize of 0 or None means unbuffered
>      if bufsize:
>          f = BufferingLayer(f, bufsize)
>      # Use bytes interface to make file line-iterable
>      return BytesInterface(f, line_sep)
>
> def textfile(filename, mode = "r", bufsize = -1, encoding = None):
>      f = FileStream(filename, mode)
>      # a bufsize of 0 or None means unbuffered
>      if bufsize:
>          f = BufferingLayer(f, bufsize)
>      # Text interface deals with line terminators correctly
>      return TextInterface(f, encoding)
>
> > with lots of pretty-formatted info. i remember people saying
> > that stating `read(n)` returns exactly `n` bytes is problematic,
> > can you elaborate?
>
> I can see that behaviour being seriously annoying when you get to the end of
> the stream. I'd far prefer for the stream to just give me the last bit when I
> ask for it and then tell me *next* time that there isn't anything left. This
> has worked well for a long time with the existing read method of file objects.
> If you want a method with the other behaviour, add a "readexact" API, rather
> than changing the semantics of "read" (although I'd be really curious to hear
> the use case for the other behaviour).
>
> (Take a look at the s3.recv(100) line in your Sock2 example - how irritating
> would it be for that to raise EOFError because you only got a few bytes?)
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> ---------------------------------------------------------------
>              http://www.boredomandlaziness.org
>