[scikit-learn] About the Boston housing prices dataset
Christian Lorentzen
lorentzen.ch at gmail.com
Wed Oct 14 06:01:58 EDT 2020
Hi
As was recently mentioned in PR #18594, the problem with the boston
housing dataset does not go away, just because we remove it from
scikit-learn. On the contrary, it is a valuable dataset to show and
teach bias and discrimination - issue #16715 is still waiting for
someone to write an example - in particular because we have access to
the variable "B".
Most, if not all, of the datasets in scikit-learn are available
elsewhere, even in python. So I don't think this is a good argument
either for removal.
As we've now removed it from tests and examples, the question for me is:
What do we want to achieve furthermore?
Answers I can think of go down a political road...
I'm fine with Olivier's suggestion
https://github.com/scikit-learn/scikit-learn/pull/18594#issuecomment-707626543.
All the best,
Christian
On 14.10.20 10:34, Adrin wrote:
> Most of those are not talking about the ethical issues of the dataset.
> Let's talk about the alternatives we have:
>
> Keep the loader, but raise a warning:
> - this will result in most people not changing their code/material,
> and IMO mostly ignore the warning. Some
> people may see the warning and care about it.
>
> Deprecate, and point them to an alternative dataset, and if they
> really really want the same dataset, point them
> to the openml ID:
> - People will have to change something, and if we give them a nice
> copy/paste-able alternative which is not boston,
> they'll use that instead.
> - Some people will keep using boston from openml, and not care about
> the ethical implications
>
> As an addition, we can keep the load_boston in the docs only, and
> point users to alternatives even after removing
> the loader.
>
> On Wed, Oct 14, 2020 at 10:11 AM Olivier Grisel
> <olivier.grisel at ensta.org <mailto:olivier.grisel at ensta.org>> wrote:
>
> Le mar. 13 oct. 2020 à 16:19, Adrin <adrin.jalali at gmail.com
> <mailto:adrin.jalali at gmail.com>> a écrit :
> >
> > Isn't the Boston dataset available through openml? Maybe here:
> https://www.openml.org/d/531
> >
> > I'm happy to have the dataset out there on opemml, and for any
> material that addresses some of the issues with it.
> > But for educational purposes, we don't need to have the dataset
> in the package as long as users can still download it
> > with a oneliner using fetch_openml.
>
> That would be an argument in favor of deprecation warning with a
> message stating the motivation for deprecation and pointing to
> fetch_openml.
>
> However it's going to break examples written in slow to update
> tutorials or book once the deprecation period is over. But one could
> argue that this is also the case for any other deprecation in
> scikit-learn. It's just that sklearn.datasets.load_boston is used A
> LOT: https://github.com/search?q=load_boston&type=code
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201014/ef665fc5/attachment-0001.html>
More information about the scikit-learn
mailing list