[SciPy-dev] Machine learning datasets (was Presentation of pymachine, a python package for machine learning)

Peter Skomoroch peter.skomoroch at gmail.com
Wed May 30 23:24:26 EDT 2007


The licensing of datasets is an interesting issue, it sounds like they will
need to be tackled one by one unless explicitly released to the public
domain.

Check out the wikipedia entry on "Open Data":

http://en.wikipedia.org/wiki/Open_Data

"Creators of data often do not consider the need to state the conditions of
ownership, licensing and re-use. For example, many scientists do not regard
the published data arising from their work to be theirs to control and the
act of publication in a journal is an implicit release of the data into the
commons. However the lack of a license makes it difficult to determine the
status of a data set <http://en.wikipedia.org/wiki/Data_set> and may
restrict the use of data offered in an Open spirit. Because of this
uncertainty it is also possible for public or private organisations to
aggregate such data, protect it with copyright and then resell it."

I remember a while back Leslie Kaelbling bought the enron dataset
http://www.cs.cmu.edu/~enron/ <http://www.cs.cmu.edu/%7Eenron/> for use in
machine learning.

Maybe we can start a scipy wikipage with a list/table of datasets along with
license status...and check off the ones which we find are not compatible so
we can find replacements or get permission.  Also, we might want to add a
column for which modules use the data in scipy tests etc.,

Should I go ahead and create the page?




On 5/30/07, Bruce Southey <bsouthey at gmail.com> wrote:
>
> Hi,
> An example, AirPassengers  is not under the GPL. If you do
> help("AirPassengers") you will see the source:
> " Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1976) _Time Series
> Analysis, Forecasting and Control._ Third Edition.  Holden-Day. Series
> G."
>
> Likewise for BJsales where the help notes was copied from the Time
> Series Data Library
> (http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/
> <http://www-personal.buseco.monash.edu.au/%7Ehyndman/TSDL/> ). The
> license for this site is free. However, the source provided is:
> "G. E. P. Box and G. M. Jenkins (1976): _Time Series Analysis,
> Forecasting and Control_, Holden-Day, San Francisco, p. 537."
>
> I don't have either book so I can not tell you if there are any terms
> for use of the dataset. In some cases I presume people would argue
> 'fair use'. Also note that these books predate the GPL (v1 was
> released Jan 1989)!
>
>
> Bruce
>
> On 5/30/07, David Cournapeau <david at ar.media.kyoto-u.ac.jp> wrote:
> > Bruce Southey wrote:
> > > Hi,
> > > You might find the UCI Machine Learning Repository a useful resource
> for data:
> > > http://www.ics.uci.edu/~mlearn/MLRepository.html<http://www.ics.uci.edu/%7Emlearn/MLRepository.html>
> > >
> > > Standard sources are:
> > > Statlib: http://lib.stat.cmu.edu/
> > > Netlib: http://www.netlib.org/
> > >
> > > Even with those included with R may be used because some are in public
> domain.
> > The main problem of datasets seem to be license. For example, you say
> > that some of the datasets in R are public domain: do you know which ones
> > (how do you know ? I looked for informations on this issue, without any
> > luck). For all I know, the datasets (at least the ones in R core) are
> > under the GPL.
> >
> > cheers,
> >
> > David
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipy-dev at scipy.org
> > http://projects.scipy.org/mailman/listinfo/scipy-dev
> >
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>



-- 
Peter N. Skomoroch
peter.skomoroch at gmail.com
http://www.datawrangling.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20070530/ae51fed7/attachment.html>


More information about the SciPy-Dev mailing list