[scikit-learn] Replacing the Boston Housing Prices dataset

G Reina greina at eng.ucsd.edu
Thu Jul 6 12:41:19 EDT 2017


Wow. I completely disagree.

The fact that too many tutorials and examples rely on it is not a reason to
keep the dataset. New tutorials are written all the time. And, as sklearn
evolves some of the existing tutorials will need to be updated anyway to
keep up with the changes.

Including "ethnicity" is completely illegal in making business decisions in
the United States. For example, credit scoring systems bend over backward
to expunge even proxy features that could be highly correlated with race
(for example, they can't include neighborhood, but can include entire
counties).

Let's leave the studying of racism to actual scientists who study racism.
Not to toy datasets that we use to teach our students about a completely
unrelated matter like regression.

-Tony


On Thu, Jul 6, 2017 at 9:31 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Tony.
>
> I don't think it's a good idea to remove the dataset, given how many
> tutorials and examples rely on it.
> I also don't think it's a good idea to ignore racial discrimination, which
> I guess this feature is trying to capture.
>
> I was recently asked to remove an excerpt from a dataset from my slide, as
> it was "too racist". It was randomly sampled
> data from the adult census dataset. Unfortunately, economics in the US are
> not color blind (yet), and the reality is racist.
> I haven't done an in-depth analysis on whether this feature is actually
> informative, but I don't think your analysis is conclusive.
>
> Including ethnicity in data actually allows us to ensure "fairness" in
> certain decision making processes.
> Without collecting this data, it would be impossible to ensure automatic
> decisions are not influenced
> by past human biases. Arguably that's not what the authors of this dataset
> are doing.
>
> Check out http://www.fatml.org/ for more on fairness in machine learning
> and data science.
>
> Cheers,
> Andy
>
>
>
> On 07/06/2017 12:05 PM, G Reina wrote:
>
> I'd like to request that the "Boston Housing Prices" dataset in sklearn
> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in
> town". It is an incredibly racist "feature" to include in any dataset. I
> think is beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on housing prices
> and has more than 5 times the amount of training examples with over 7 times
> as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-
> science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_
> flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/64023ad6/attachment.html>


More information about the scikit-learn mailing list