[SciPy-user] Presentation of pymachine, a python package for machine learning

David Cournapeau david at ar.media.kyoto-u.ac.jp
Mon May 14 03:44:41 EDT 2007


Petr Šimon wrote:
> On Monday 14 May 2007 14:53:32 Aldarion wrote:
>> David Cournapeau wrote:
>>> Dear scipy developers and users,
>>>    - For people willing to use machine learning related software in
>>> python/scipy, what are the main requirements/concern ? (eg Data
>>> exploration GUI, efficiency, readability of the algorithms, etc...)
>>>
>>>    cheers,
>>>
>>>    David
>> to me, efficiency and readability of the algorithm.
>> and orange impressed me.
>> but neither orange nor numpy handle sparse matrix smoothly,
>> for example,don't know howto SVD a large-scale sparse matrix with numpy.
>>
> In general most of the ML packages like weka and orange are great for small 
> projects, but since they typically load all the data into memory, you are on 
> your own with larger dataset. This is what I find to be a major concern for 
> me.
I understand the limitation. I think it is important on the frontend 
side to have a global mechanism to enable ondisk data, for streaming 
data directly from files instead of loading everything in memory. But 
then, there is a problem on the back-end side: most algorithms expects 
all their input data at once.

For example, one of the algorithm which will be supported is Expectation 
Maximization for mixture of Gaussian. Every iteration of the EM 
algorithm expects its data to be available; there are some extension 
possible to enable iterative EM algorithms (one implementation is 
available in sandbox.pyem, but really slow for now for no good reason 
outside lazyness).

Basically, I have not thought a lot about this problem, but I think that 
it needs explicit support from the algorithm itself to be useful, 
generally. The algorithm has to be able to run several times on 
different parts of the dataset, while remembering already computed parts 
(I don't know if there is a global name for this kind of behaviour).

What do you have in mind when talking about big problems ? What kind of 
size are we talking about ?

David



More information about the SciPy-User mailing list