[SciPy-dev] Starting a datasets package, again

Wed Jun 6 01:33:17 EDT 2007

David Cournapeau wrote:
> Robert Kern wrote:
>> The iris and oldfaithful packages you posted earlier were good. We might want to
>> fiddle with the metadata later, but what you had is probably sufficient.
> Those data were from r-base, and I thought that following our discussion 
> on licensing, it would have been better not to use them.

Sorry, I meant the form, not necessarily the content.

> The think I really like with the datasets in R is that any package can 
> depend on them for demos/examples/etc... I don't know much about 
> easy_install yet, but does the dependency tracking system work well ? 

It works fine, provided that all of the relevant packages are registered on the
PyPI and are installed as eggs (or with egg metadata) on user's machines.

> For example, you install foo which uses faithful in some examples, when 
> is the dependency resolved ?

I would recommend that the example's dependencies be listed as an "extras"
dependency. The setup() for, say, scikits.pyem would have these arguments:

  ...
  install_requires = ['numpy'],
  extras_require = {
    'examples': ['scipydata.iris', 'scipydata.oldfaithful'],
  },
  ...

Then, if you want to be able to run the examples for scikits.pyem, you would do
this:

  $ easy_install "scikits.pyem[examples]"

However, just running

  $ easy_install scikits.pyem

won't install the data packages (this is a good thing).

> Would it be ok to use them in tests ?

I would like to avoid that. Just include the data that you need in the code or
in a file included with the package. If you need lots of data, though, you're
writing the wrong kind of test, IMO.

> For the old faithful data, the answer I received from Pr Azzalani (whose 
> article "A look at some data on the old faithful geyser" has original 
> data) is that an acknowledgment would be welcomed, so if we acknowledge 
> it in the sources, is it OK to apply BSD (the problem would be if people 
> using it would be required to acknowledge as well, right ?)

If anyone has the right to make this decision, it would be him and his coauthor.

I've just taken a look at the "Open Data" Wikipedia article that Peter Skomoroch
linked in the last discussion, something I should have done earlier. From it, I
found a link to Science Commons, a branch of the Creative Commons project.

  http://sciencecommons.org

Sadly, they do not have a license pre-made that we could simply suggest to
authors we approach. As the SC FAQ explains, the database protection is not, in
fact, copyright but a similar kind (_sui generis_ in lawyer-speak) of right
carved out by the EU Database Directive (and similar laws implemented by member
and non-member states). The Creative Commons licenses (with some
nationally-specific exceptions) only operate on copyrighted works, not
almost-but-not-quite-copyrighted works. <sigh>.

However, and this is the good bit, that right expires in 15 years.

  http://ec.europa.eu/archives/ISPO/legal/en/ipr/database/text.html#HD_NM_14

Of course, we will give the appropriate citation (no law compels us to, but we
shouldn't need laws to compel us to do this little). We should also include the
request of the author for acknowledgment as well. I think it would be nice to
state that we think the data is in the public domain given the above reasoning
since this is one time that I think we can nail down something concrete in a
very fuzzy area.

We should write our own descriptive text instead of using that from the R
package; that *does* fall under the copyright of whoever wrote it. And this
raises another fuzzy issue: the copyright/_sui generis_ right of the data is
different from the copyright of the surrounding text and code. There's going to
necessarily be some confusion, I think.

If you want a declaration from me, I would say that the surrounding text and
code in scipydata packages should always be under the BSD license. This should
be noted using the "License :: OSI Approved :: BSD License" classifier in the
setup script and in a *comment* in the code following the copyright notice.
However, the copyright notice and license should be accompanied by a note that
the data does not fall under this license or copyright and the metadata to look
at to find the status of the data. I'm not good at legal boilerplate, but
something like the following would be fine, I think:

# The code and descriptive text is copyrighted and offered under the terms of
# the BSD License from the authors; see below. However, the actual dataset may
# have a different origin and intellectual property status. See the SOURCE and
# COPYRIGHT variables for this information.
#
# Copyright (c) 2007 Enthought, Inc.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# ..., etc.

David, thank you for pursuing this with the care that you have, and thank you
for bearing with my long-winded pontificating while you do all of the actual
work.  :-)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco