[scikit-learn] About the Boston housing prices dataset

Wed Oct 14 06:01:58 EDT 2020

Hi

As was recently mentioned in PR #18594, the problem with the boston 
housing dataset does not go away, just because we remove it from 
scikit-learn. On the contrary, it is a valuable dataset to show and 
teach bias and discrimination - issue #16715 is still waiting for 
someone to write an example - in particular because we have access to 
the variable "B".

Most, if not all, of the datasets in scikit-learn are available 
elsewhere, even in python. So I don't think this is a good argument 
either for removal.

As we've now removed it from tests and examples, the question for me is: 
What do we want to achieve furthermore?
Answers I can think of go down a political road...

I'm fine with Olivier's suggestion 
https://github.com/scikit-learn/scikit-learn/pull/18594#issuecomment-707626543.

All the best,
Christian

On 14.10.20 10:34, Adrin wrote:
> Most of those are not talking about the ethical issues of the dataset. 
> Let's talk about the alternatives we have:
>
> Keep the loader, but raise a warning:
> - this will result in most people not changing their code/material, 
> and IMO mostly ignore the warning. Some
> people may see the warning and care about it.
>
> Deprecate, and point them to an alternative dataset, and if they 
> really really want the same dataset, point them
> to the openml ID:
> - People will have to change something, and if we give them a nice 
> copy/paste-able alternative which is not boston,
> they'll use that instead.
> - Some people will keep using boston from openml, and not care about 
> the ethical implications
>
> As an addition, we can keep the load_boston in the docs only, and 
> point users to alternatives even after removing
> the loader.
>
> On Wed, Oct 14, 2020 at 10:11 AM Olivier Grisel 
> <olivier.grisel at ensta.org <mailto:olivier.grisel at ensta.org>> wrote:
>
>     Le mar. 13 oct. 2020 à 16:19, Adrin <adrin.jalali at gmail.com
>     <mailto:adrin.jalali at gmail.com>> a écrit :
>     >
>     > Isn't the Boston dataset available through openml? Maybe here:
>     https://www.openml.org/d/531
>     >
>     > I'm happy to have the dataset out there on opemml, and for any
>     material that addresses some of the issues with it.
>     > But for educational purposes, we don't need to have the dataset
>     in the package as long as users can still download it
>     > with a oneliner using fetch_openml.
>
>     That would be an argument in favor of deprecation warning with a
>     message stating the motivation for deprecation and pointing to
>     fetch_openml.
>
>     However it's going to break examples written in slow to update
>     tutorials or book once the deprecation period is over. But one could
>     argue that this is also the case for any other deprecation in
>     scikit-learn. It's just that sklearn.datasets.load_boston is used A
>     LOT: https://github.com/search?q=load_boston&type=code
>
>     -- 
>     Olivier
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201014/ef665fc5/attachment-0001.html>