[SciPy-user] Presentation of pymachine, a python package for machine learning
David Cournapeau
david at ar.media.kyoto-u.ac.jp
Mon May 14 03:44:41 EDT 2007
Petr Šimon wrote:
> On Monday 14 May 2007 14:53:32 Aldarion wrote:
>> David Cournapeau wrote:
>>> Dear scipy developers and users,
>>> - For people willing to use machine learning related software in
>>> python/scipy, what are the main requirements/concern ? (eg Data
>>> exploration GUI, efficiency, readability of the algorithms, etc...)
>>>
>>> cheers,
>>>
>>> David
>> to me, efficiency and readability of the algorithm.
>> and orange impressed me.
>> but neither orange nor numpy handle sparse matrix smoothly,
>> for example,don't know howto SVD a large-scale sparse matrix with numpy.
>>
> In general most of the ML packages like weka and orange are great for small
> projects, but since they typically load all the data into memory, you are on
> your own with larger dataset. This is what I find to be a major concern for
> me.
I understand the limitation. I think it is important on the frontend
side to have a global mechanism to enable ondisk data, for streaming
data directly from files instead of loading everything in memory. But
then, there is a problem on the back-end side: most algorithms expects
all their input data at once.
For example, one of the algorithm which will be supported is Expectation
Maximization for mixture of Gaussian. Every iteration of the EM
algorithm expects its data to be available; there are some extension
possible to enable iterative EM algorithms (one implementation is
available in sandbox.pyem, but really slow for now for no good reason
outside lazyness).
Basically, I have not thought a lot about this problem, but I think that
it needs explicit support from the algorithm itself to be useful,
generally. The algorithm has to be able to run several times on
different parts of the dataset, while remembering already computed parts
(I don't know if there is a global name for this kind of behaviour).
What do you have in mind when talking about big problems ? What kind of
size are we talking about ?
David
More information about the SciPy-User
mailing list