[Python-checkins] GH-77265: Document NaN handling in statistics functions that sort or count (#94676)

stevendaprano webhook-mailer at python.org
Sun Jul 10 03:40:44 EDT 2022


https://github.com/python/cpython/commit/ef61b259e35a0249840184b59f43d8a7f9b095bc
commit: ef61b259e35a0249840184b59f43d8a7f9b095bc
branch: main
author: Raymond Hettinger <rhettinger at users.noreply.github.com>
committer: stevendaprano <steve+python at pearwood.info>
date: 2022-07-10T17:40:27+10:00
summary:

GH-77265: Document NaN handling in statistics functions that sort or count (#94676)

* Document NaN handling in functions that sort or count

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <erlend.aasland at protonmail.com>

* Update Doc/library/statistics.rst

Co-authored-by: Erlend Egeberg Aasland <erlend.aasland at protonmail.com>

* Fix trailing whitespace and rewrap text

Co-authored-by: Erlend Egeberg Aasland <erlend.aasland at protonmail.com>

files:
M Doc/library/statistics.rst

diff --git a/Doc/library/statistics.rst b/Doc/library/statistics.rst
index 347a1be8321e4..5aef6f6f05d63 100644
--- a/Doc/library/statistics.rst
+++ b/Doc/library/statistics.rst
@@ -35,6 +35,35 @@ and implementation-dependent.  If your input data consists of mixed types,
 you may be able to use :func:`map` to ensure a consistent result, for
 example: ``map(float, input_data)``.
 
+Some datasets use ``NaN`` (not a number) values to represent missing data.
+Since NaNs have unusual comparison semantics, they cause surprising or
+undefined behaviors in the statistics functions that sort data or that count
+occurrences.  The functions affected are ``median()``, ``median_low()``,
+``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and
+``quantiles()``.  The ``NaN`` values should be stripped before calling these
+functions::
+
+    >>> from statistics import median
+    >>> from math import isnan
+    >>> from itertools import filterfalse
+
+    >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
+    >>> sorted(data)  # This has surprising behavior
+    [20.7, nan, 14.4, 18.3, 19.2, nan]
+    >>> median(data)  # This result is unexpected
+    16.35
+
+    >>> sum(map(isnan, data))    # Number of missing values
+    2
+    >>> clean = list(filterfalse(isnan, data))  # Strip NaN values
+    >>> clean
+    [20.7, 19.2, 18.3, 14.4]
+    >>> sorted(clean)  # Sorting now works as expected
+    [14.4, 18.3, 19.2, 20.7]
+    >>> median(clean)       # This result is now well defined
+    18.75
+
+
 Averages and measures of central location
 -----------------------------------------
 



More information about the Python-checkins mailing list