[SciPy-user] Presentation of pymachine, a python package for machine learning

David Cournapeau david at ar.media.kyoto-u.ac.jp
Mon May 14 04:16:17 EDT 2007


Petr Šimon wrote:
> On Monday 14 May 2007 15:44:41 David Cournapeau wrote:
>> Petr Šimon wrote:
>>> On Monday 14 May 2007 14:53:32 Aldarion wrote:
>>>> David Cournapeau wrote:
>>>>> Dear scipy developers and users,
>>>>>    - For people willing to use machine learning related software in
>>>>> python/scipy, what are the main requirements/concern ? (eg Data
>>>>> exploration GUI, efficiency, readability of the algorithms, etc...)
>>>>>
>>>>>    cheers,
>>>>>
>>>>>    David
>>>> to me, efficiency and readability of the algorithm.
>>>> and orange impressed me.
>>>> but neither orange nor numpy handle sparse matrix smoothly,
>>>> for example,don't know howto SVD a large-scale sparse matrix with numpy.
>>> In general most of the ML packages like weka and orange are great for
>>> small projects, but since they typically load all the data into memory,
>>> you are on your own with larger dataset. This is what I find to be a
>>> major concern for me.
>> I understand the limitation. I think it is important on the frontend
>> side to have a global mechanism to enable ondisk data, for streaming
>> data directly from files instead of loading everything in memory. But
>> then, there is a problem on the back-end side: most algorithms expects
>> all their input data at once.
>>
>> For example, one of the algorithm which will be supported is Expectation
>> Maximization for mixture of Gaussian. Every iteration of the EM
>> algorithm expects its data to be available; there are some extension
>> possible to enable iterative EM algorithms (one implementation is
>> available in sandbox.pyem, but really slow for now for no good reason
>> outside lazyness).
>>
>> Basically, I have not thought a lot about this problem, but I think that
>> it needs explicit support from the algorithm itself to be useful,
>> generally. The algorithm has to be able to run several times on
>> different parts of the dataset, while remembering already computed parts
>> (I don't know if there is a global name for this kind of behaviour).
>>
>> What do you have in mind when talking about big problems ? What kind of
>> size are we talking about ?
>>
> Yes I know it's not as easy as I would wish :), you did spell it out quite 
> well. E.g. I had cca 14mil 5-D vectors. 
Well, if by 14mil you mean 14 millions, if every point is a double 
complex, that is it takes 16 bytes, then it should more or less fit in 
memory, no ? For memory problems, I see at least 2 different cases in 
the machine learning context:

    - the data used for learning
    - the testing data/ data to classify

The second case is really "just" a question of being careful and having 
a good framework: this is a constraint I will definitely try to respect. 
The first case is much more difficult from an algorithmic point of view, 
because most learning algorithms do not respect the locality property 
very well, at least in direct implementations.

David



More information about the SciPy-User mailing list