OTish: using short-term TCP connections to send to multiple slaves

Sun Nov 16 10:13:08 EST 2014

On Mon, Nov 17, 2014 at 12:43 AM, Alain Ketterlin
<alain at dpt-info.u-strasbg.fr> wrote:
> jkn <jkn_gg at nicorp.f9.co.uk> writes:
>
>> I have a use case of a single 'master' machine which will need to
>> periodically 'push' data to a variety of 'slave' devices on a small local
>> subnet, over Ethernet. We are talking perhaps a dozen devices in all with
>> comms occurring perhaps once very few seconds, to much less often - once per
>> half an hour, or less. There is probably an upper bound of 64KB or so of
>> data that is likely to be sent on each occasion.
>
> OK, no big requirements, but 64K is still way too much to consider UDP.

I wouldn't say "way too much"; the packet limit for UDP is actually
64KB (minus a few bytes of headers). But UDP for anything more than
your network's MTU is inefficient, plus you'd need to roll your own
acknowledgement system so you know when the client got the data, at
which point you're basically recreating TCP.

That said, though: UDP would be a good option, if and only if you can
comply with those two restrictions - sender doesn't care if the odd
packet doesn't get through, and no packet is allowed to exceed 64KB,
with typical packet sizes wanting to be more like 1KB. (DNS servers
usually switch you to TCP if you go above 512 bytes of response, but
that's because DNS responses are usually tiny.) It'd be pretty easy to
knock together a simple UDP system - you have the clients listen on
some particular port, and you could even make use of IP broadcast to
send to everyone all at once, since this is a single LAN.

>> Previous similar systems have attempted to do this by maintaining multiple
>> long-term TCP connections from the master to all the slave devices. The
>> Master is the server and the slaves periodically check the master to see
>> what has changed. Although this ... works ..., we have had trouble
>> maintaining the connection, for reasons ... I am not yet fully aware
>> of.
>
> This doesn't make much sense to me. On a packet switching network
> "maintaining the connexion" simply means keeping both ends alive. There
> is nothing "in the middle" that can break the connection (unless you
> use, e.g., ssh tunneling, but then it's a configuration problem, or NAT,
> but I doubt it is your case on an LAN).

NAT is the most common cause of breakage, but any router does have the
power to monitor and drop connections on any basis it likes. (I can't
imagine any reason a non-NAT router would want to prevent connections
from going idle, but it could be done.) It's also possible the
connections are being dropped at a lower level; for instance, a
wireless network might drop a client and force it to reconnect. I've
seen this happen with a number of cheap home wireless routers, and
generally Linux boxes hang onto any application-level connections just
fine (assuming DHCP doesn't cause a change of IP address at the same
time), but Windows XP (haven't checked more recent Windowses) goes and
closes any sockets that were using that connection... without sending
RST packets to the server, of course. So the client knows it's lost
link, but the server doesn't, until the next time it tries to send (at
which point it *might* get a courteous RST, or it might have to time
out).

So, I wouldn't say it's impossible for the connections to be dying...
but I would say for certain that connection death is diagnosable, and
on a LAN, often quite easily diagnosable.

> Yes but a TCP server is slightly more complex: it has to "accept()"
> connexions, which means it blocks waiting for something to come in. Or
> it has to periodically poll its passive socket, but then you have a
> problem with timeouts. Your describing the slaves as "devices" makes me
> think they are simple machines, maybe embedded micro-controllers. If it
> is the case, you may not want to put the burden on slaves.

You should be able to do an asynchronous accept. I don't know the
details of doing that in Python, but I expect asyncio can do this for
you. In Pike, I quite frequently run a single-threaded program that
just drops to a back-end loop, listening for new connections, new text
on the current connections, or any other events. At the C level, this
would be done with select() and friends. So you shouldn't have to
block, nor poll, though it would require a bit of extra work.

I wouldn't assume that "devices" are microcontrollers, though. I talk
about full-scale computers that way, if they're the "smaller" end of
the connection. For example, I have a video-playing server which
invokes VLC, and it runs a little HTTP daemon to allow end users to
choose what video gets played. The HTTP clients are usually on various
people's desktop computers, which generally have a lot more grunt than
the little video server does; but I call them "devices" in terms of
the Yosemite Project, because all they are is a source of signals like
"play such-and-such file", "pause", "skip forward", "stop".

> (BTW why not have the slaves periodically connect -- as clients -- to
> the master? No need to keep the connection open, they can disconnect and
> reconnect later. All you need is a way for the master to identify slaves
> across several connections, but an IP address should be enough, no?)

That's an option that works nicely if the slaves know when they've
lost link. Without knowing more about these connection failures, it's
hard to say which end should resolve the issue. In the worst-case
scenario, all transmission is in one direction (server to clients),
and packets for "old" connections simply start getting dropped;
possible, unlikely, and very annoying. In that case, the clients will
have no idea that they need to reconnect - the server will,
eventually, because its data packets aren't getting acknowledged, but
the clients just assume all's well. On the other hand, the example of
Windows XP and wifi is one where the clients know virtually instantly
that they've been cut off, but the server doesn't; if that's what's
happening, then definitely they should be the ones to reconnect (and
the server should probably close off the previous connection from that
IP).

>> I should also add that we desire our system to be 'robust' in the face of
>> situations such as cable unplugging, device power cycles, etc.
>
> Then avoid connected protocols like TCP. Maybe you should consider SCTP,
> which is a message-oriented, reliable protocol. There is a pysctp module
> on pypi. (Haven't tried it, I've just made a quick search on SCTP in
> Python.)
>
> Or roll your own protocol on top of UDP, but that's another story.

Not sure why TCP shouldn't be used here. Depending on the definition
of 'robust', TCP might be anywhere from "overkill but doesn't hurt" to
"perfectly doing what you need". All you need is a little layer around
it to cope with power cycling the device (something like "program
starts on device boot, connects to server, server maybe kicks off the
other connection") and everything else will be handled with return
codes. You do need to make sure you do everything asynchronously,
though - either non-blocking I/O or threads - because otherwise you'll
be sitting there waiting for timeouts when some client disconnects.

OP, more information needed:
1) What should happen when a client is unavailable for a while? Should
messages be queued, or dropped?
2) If a message arbitrarily sometimes doesn't get through, what are
the consequences? Are subsequent messages meaningless, or can it just
move on?
3) What usually happens when a connection breaks? Which end is first
to find out?

ChrisA