[summerofcode] Application sent

ChunWei Ho fuzzybr80 at gmail.com
Tue Jun 7 06:13:39 CEST 2005


On 6/4/05, Ian Bicking <ianb at colorstudy.com> wrote:

> I think it's good -- one thing that open source has taught me is that
> ideas aren't very valuable when compared to implementation (code or
> otherwise).  And transparency is good.

That is certainly true, in retrospect. I've learnt something already :)

> 
> > Project Title: Data Serving/Collection Framework in Python/WSGI
> >
> > Proposed Mentor/Sponsoring Organization: Python Software Foundation
> >
> > Project Description:
> > A framework based on bulk data serving/collection via the internet.
> > Bulk data are in the form of files that could easily be
> > several-several hundred MB (not surveys or simple POST data).
> >
> > The client has a file repository that it wishes to sync to the server
> > (a WSGI application). This server should be able to facilitate
> > transfer via a number of protocols, including HTTP file transfer, HTTP
> > form upload, FTP, Email.
> >
> > This project is aimed not at yet another ad-hoc file transfer or p2p
> > file-sharing program but as a persistent production setup for
> > transferring data from data collection sites/areas to a server,
> > possibly via internet through different methods to get through strict
> > organizational firewalls and web admins.
> >
> > Unlike a normal straightforward file transfer application, the
> > framework should support:
> > + Authentication and encryption
> > + Verification scheme for data transfer, retries, etc - MD5 hash compare?
> > + Chunking of large files and reassemble on receipt
> > + Partial/Resume file transfers support - may depend on nature of data
> 
> This can also be part of the file transfer app, using HTTP range support.
> 
> Each piece that can be implemented in a generic way will be easier to
> decouple, test, and implement.  And HTTP has a lot of possible
> functionality that's worth implementing directly.  For instance, etags
> are similar in function to hashes, and there's a standard header for
> giving the hash of a body (I don't think it gets much use, though,
> because TCP/IP is reliable enough).  Even encryption can even be done in
> terms of SSL with client certificates (though that might be difficult,
> as SSL happens at a level that is sometimes hard to get access to,
> depending on your server).

Good idea :) Some portions of the support functions, like chunking of
large files, may need additional support outside of the carrying
protocol too, so the application layer will keep track of these apart
from the actual transfer of individual chunks by the session
(HTTP/SSL)

> > Also, unlike commercial advanced file transfer programs, the framework does:
> > + Supports multiple protocols for transfer HTTP/FTP/Email
> > + Automatic identification of files to synchronize (comparison of
> > server and client repositories and request automatically)
> > + Conditional Processing (triggers - resync file if modified? logic -
> > user specified)
> > + Robust and considerate client - may be shared machine, means a
> > service (I initially designed it for Windows clients - platform choice
> > was not up to me) that must be configurable on when it runs, how long
> > it runs
> > + and if configured limit does not allow client to sync all data -
> > what must be synced first (Latest file first, Earliest file first,
> > Latest file only, etc). This form of consideration seems to be
> > important for running on production sites or factory machines when the
> > machine is in use in the day but idle for our use at night, or when
> > machines have internet connectivity (possibly dialup) at only certain
> > times of the day.
> 
> How do you see it as different from rsync?  If it's not that different,
> that's not so bad -- derivative perhaps, but rsync is very popular and
> useful, and you can do a lot worse than copy a useful piece of software.
>   If the pieces that are used to implement it are decoupled, then that
> leaves yourself or other people room to recombine the pieces in novel
> ways, while at the same time copying something useful means you'll have
> a set of pieces that have proven utility.
> 
> This will be especially true to the degree you utilize HTTP's potential.

rsync is a great idea! I guess the idea of submitting only diffs for
existing files escaped me since my original idea had primarily dealt
with new files. Most of the support functions like timing, conditional
may not be fully supported by rsync, but they could certainly be set
up in combination with other *nix server tools.

In that sense this project is derivative. The main impetus for this is
to transfer data in a secure/safe manner over HTTP/FTP/Email - going
over the usual site firewalls/proxies as allowed traffic, instead of
direct connections through them.

And many of the other support functions, on the server or client side
- can be used elsewhere as well.

> > Development will be based on WSGI/Paste model, although I will also
> > investigate Zope/Cherry/Plone and other frameworks purely for
> > comparison or design consideration purposes. WSGI is chosen for small
> > learning curve, as well as the fact that data collection for an
> > application can be separated from other functions.
> 
> I think the benefit of WSGI here -- and I think it is considerable -- is
> that it is low-level enough that you don't have to work around places
> where the framework isn't intended for how you are using it.  This is
> especially true of large file support and more advanced HTTP
> functionality (like ranges and etags and that sort of thing).  Lots of
> frameworks are notably bad at large files in particular.

One additional query here... It seems that some of the advantages
provided by WSGI/Paste comes from the richness of HTTP support. Maybe,
i'm shooting myself in the foot here :) but which approach do you
think it is better in this case for the file serving application (for
example if I am thinking of using FTP):

FTP translated to HTTP: The remote client sends in FTP, which is
translated by the app into HTTP actions for WSGI. It takes the
response and does the necessary to reply in FTP. The client is only
aware that the remote server is an FTP server. It means that
encryption and verification support has to be managed outside of the
file transfer protocol.

FTP encapsulated by HTTP: The remote client/server sends FTP data
encapsulated in HTTP - which allows it to bring in support functions
like hash headers, ranges, etags (I think SSL can work over direct
FTP). But of course the client has to be capable of reading and
stripping the HTTP layer. In this case will FTP seem to become
redundant (HTTP data transfer encapsulating FTP data transfer)?

I appreciate your insights. :)

Chun Wei


More information about the summerofcode mailing list