[Pandas-dev] API: Make silent casting behavior consistent by deprecating silent _object_-dtype casting
Brock Mendel
jbrockmendel at gmail.com
Wed Oct 27 00:38:10 EDT 2021
TLDR
----
We have inconsistent silent-casting vs raising logic for numpy vs EA dtypes
(and inconsistencies within EA dtypes). By deprecating silently casting to
*object* dtype, we can *mostly* make the behaviors match.
Background
----------
A number of Series/DataFrame methods will silently cast when dealing with
mismatched values. With a numpy dtype, each of the following silently
cast to float64:
ser = pd.Series([1, 2, 3], dtype="i8")
ser.shift(1, fill_value=1.5)
ser.mask([True, False, False], 1.5)
ser.where([False, True, True], 1.5)
ser.replace(1, 1.5)
ser[0] = 1.5
ser.fillna(1.5) # <- this one doesn't cast as it is a no-op
If we were to pass "foo" or a pd.Period, these would coerce to object
instead of float.
By contrast, similar mixed-type operations with an ExtensionDtype Series
_mostly_ raise:
ser2 = pd.Series(pd.period_range("2016-01-01", periods=3, freq="D"))
ser2.shift(1, fill_value=1.5) # <- ValueError
ser2.mask([True, False, False], 1.5) # <- ValueError
ser2.where([False, True, True], 1.5) # <- ValueError
ser2.fillna(1.5) # <- TypeError
ser2.replace(ser2[0], 1.5) # <- coerces to object
ser2[0] = 1.5 # <- coerces to object
ser3 = pd.Series([pd.NA, 2, 3], dtype="Int64")
ser3.shift(1, fill_value=1.5) # <- TypeError
ser3.mask([True, False, False], 1.5) # <- TypeError
ser3.where([False, True, True], 1.5) # <- TypeError
ser3.fillna(1.5) # <- TypeError
ser3.replace(ser3[0], 1.5) # <- TypeError
ser3[0] = 1.5 # <- TypeError
timedelta64, datetime64, and datetime64tz mostly behave like the numpy
dtypes,
with a few exceptions:
- shift raises on mismatch
- fillna raises on mismatch for timedelta64, casts for the others
Categorical mostly behaves like other ExtensionDtypes, except for replace
which
has special logic.
Goals
-----
- Have matching behavior across dtypes.
- Share code.
Options
-------
1) Change EA (and dt64/td64) behavior to match non-EA behavior
2) Change non-EA behavior to match EA behavior (or stricter xref
https://github.com/pandas-dev/pandas/issues/39584)
3) Deprecate (and eventually raise on) silent casting to _object_ dtype,
allowing silent casting otherwise.
Here I am advocating for option 3). The advantages as I see them:
A) For numpy dtypes, we retain the most useful cases (int->float)
B) Deprecates cases most likely to be unintentional (e.g. typo "2016-01-01"
-> "2p16-01-01" causing a datetime64 Series to silently cast)
C) For td64/dt64/dt64tz/period, the *only* silent casting is to object, so
this completely gets rid of special-casing among that code
D) For IntegerArray, FloatingArray, IntervalArray leaves open the option of
allowing e.g. Integer->Floating casting (xref
https://github.com/pandas-dev/pandas/issues/25288#issuecomment-941762174)
E) Does not preclude later deciding on the stricter options in 2)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211026/9fe9d32e/attachment.html>
More information about the Pandas-dev
mailing list