[pypy-dev] pre-emptive micro-threads utilizing shared memory message passing?

Fri Aug 6 04:30:27 CEST 2010

Note: Gabriel, do you think we should discuss this on another mailing list (or in private) as I'm not sure this related to PyPy dev anymore?

Anywyas, what are your future plans for the project?
Is it just an experiment for school ... maybe in the hopes that others would maintaining it if it was found to be interesting?
...
are you planning actual future development, maintenance, promotion of it yourself?

-----------

On a personal note... the concept has a lot of similarities to what I am exploring. However, I would have to make so many additional modifications. Perhaps you can give some thoughts on whether it would take me a long time to add such things?

Some examples:

* Two additional message passing styles (in addition to your own)
Queues - multiple tasklets can push onto queue, only one tasklet can pop.... multiple tasklets can access the property to find out if there is any data in the queue. Queues can be set to an infite size or set with a max # of entries allowed.

Streams - I'm not sure of the exact name, but kind of like an infinite stream/buffer ... useful for passing infinite amounts of data. Only one tasklet can write/add data. Only one tasklet can read/extract data.

* Message passing
When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them.

In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream.

* Scheduler
For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler.

------------
On another note, I am curious how you handled the issue of "nested" objects. Consider send() and receive() that you use to pass objects around in your project. Am I correct in that these objects cannot contain references outside of themselves? Also, how do you handle extracting out of the tree and making sure there are not references outside the object?

For example, consider the following object, where "->" means it has a reference to that object

Object 1 -> Object 2

Object 2 -> Object 3
Object 2 -> Object 4

Object 4 -> Object 2

Now, let's say I have a tasklet like the following:

.... -> incoming data = pointer/reference to Object 1

1. read incoming data (get Object 1 reference)
2. remove Object 3
3. send Object 3 to tasklet B
4. send Object 1 to tasklet C

Result:
tasklet B now has this object:
pointer/reference to Object 1, which contains the following tree:
Object 1 -> Object 2
Object 2 -> Object 4
Object 4 -> Object 2

tasklet C now has this object:
pointer/reference to Object 3, which contains the following tree:
Object 3

On the other hand, consider the following scenario:

1. read incoming data (get Object 1 reference)
2. remove Object 4
ERROR: this would not be possible, as it refers to Object 2

> Sorry for the late answer, I was unavailable in the last few days.
>
> About send() and receive(), it depends on if the communication is local
> or not. For a local communication, anything can be passed since only
> the reference is sent. This is the base model for Stackless channels.
> For a remote communication (between two interpreters), any picklable
> object (a copy will then be made) and it includes channels and tasklets
> (for which a reference will automatically be created).
>
> The use of the PyPy proxy object space is to make remote communication
> more Stackless like by passing object by reference. If a ref_object is
> made, only a reference will be passed when a tasklet is moved or the
> object is sent on a channel. The object always resides where it was
> created. A move() operation will also be implemented on those objects
> so they can be moved around like tasklets.
>
> I hope it helps,
>
> Gabriel
>
> 2010/7/29 Kevin Ar18>
>
>> Hello Kevin,
>> I don't know if it can be a solution to your problem but for my
>> Master Thesis I'm working on making Stackless Python distributed. What
>> I did is working but not complete and I'm right now in the process of
>> writing the thesis (in french unfortunately). My code currently works
>> with PyPy's "stackless" module onlyis and use some PyPy specific
>> things. Here's what I added to Stackless:
>>
>> - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A
>> node is an instance of an interpreter.
>> - Each tasklet has its global namespace (to avoid sharing of data). The
>> state is also easier to move to another interpreter this way.
>> - Distributed channels: All requests are known by all nodes using the
>> channel.
>> - Distributed objets: When a reference is sent to a remote node, the
>> object is not copied, a reference is created using PyPy's proxy object
>> space.
>> - Automated dependency recovery when an object or a tasklet is loaded
>> on another interpreter
>>
>> With a proper scheduler, many tasklets could be automatically spread in
>> multiple interpreters to use multiple cores or on multiple computers. A
>> bit like the N:M threading model where N lightweight threads/coroutines
>> can be executed on M threads.
>
> Was able to have a look at the API...
> If others don't mind my asking this on the mailing list:
>
> * .send() and .receive()
> What type of data can you send and receive between the tasklets? Can
> you pass entire Python objects?
>
> * .send() and .receive() memory model
> When you send data between tasklets (pass messages) or whateve you want
> to call it, how is this implemented under the hood? Does it use shared
> memory under the hood or does it involve a more costly copying of the
> data? I realize that if it is on another machine you have to copy the
> data, but what about between two threads? You mentioned PyPy's proxy
> object.... guess I'll need to read up on that.
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>
>
>
> --
> Gabriel Lavoie
> glavoie at gmail.com