[Pandas-dev] API: Make silent casting behavior consistent by deprecating silent _object_-dtype casting

Brock Mendel jbrockmendel at gmail.com
Wed Oct 27 00:38:10 EDT 2021


TLDR
----
We have inconsistent silent-casting vs raising logic for numpy vs EA dtypes
(and inconsistencies within EA dtypes).  By deprecating silently casting to
*object* dtype, we can *mostly* make the behaviors match.


Background
----------
A number of Series/DataFrame methods will silently cast when dealing with
mismatched values.  With a numpy dtype, each of the following silently
cast to float64:

    ser = pd.Series([1, 2, 3], dtype="i8")

    ser.shift(1, fill_value=1.5)
    ser.mask([True, False, False], 1.5)
    ser.where([False, True, True], 1.5)
    ser.replace(1, 1.5)
    ser[0] = 1.5
    ser.fillna(1.5)  # <- this one doesn't cast as it is a no-op

If we were to pass "foo" or a pd.Period, these would coerce to object
instead of float.

By contrast, similar mixed-type operations with an ExtensionDtype Series
_mostly_ raise:

    ser2 = pd.Series(pd.period_range("2016-01-01", periods=3, freq="D"))

    ser2.shift(1, fill_value=1.5)         # <- ValueError
    ser2.mask([True, False, False], 1.5)  # <- ValueError
    ser2.where([False, True, True], 1.5)  # <- ValueError
    ser2.fillna(1.5)                      # <- TypeError
    ser2.replace(ser2[0], 1.5)            # <- coerces to object
    ser2[0] = 1.5                         # <- coerces to object

    ser3 = pd.Series([pd.NA, 2, 3], dtype="Int64")

    ser3.shift(1, fill_value=1.5)         # <- TypeError
    ser3.mask([True, False, False], 1.5)  # <- TypeError
    ser3.where([False, True, True], 1.5)  # <- TypeError
    ser3.fillna(1.5)                      # <- TypeError
    ser3.replace(ser3[0], 1.5)            # <- TypeError
    ser3[0] = 1.5                         # <- TypeError

timedelta64, datetime64, and datetime64tz mostly behave like the numpy
dtypes,
with a few exceptions:

    - shift raises on mismatch
    - fillna raises on mismatch for timedelta64, casts for the others

Categorical mostly behaves like other ExtensionDtypes, except for replace
which
has special logic.

Goals
-----
- Have matching behavior across dtypes.
- Share code.

Options
-------
1) Change EA (and dt64/td64) behavior to match non-EA behavior
2) Change non-EA behavior to match EA behavior (or stricter xref
https://github.com/pandas-dev/pandas/issues/39584)
3) Deprecate (and eventually raise on) silent casting to _object_ dtype,
allowing silent casting otherwise.


Here I am advocating for option 3).  The advantages as I see them:

A) For numpy dtypes, we retain the most useful cases (int->float)
B) Deprecates cases most likely to be unintentional (e.g. typo "2016-01-01"
-> "2p16-01-01" causing a datetime64 Series to silently cast)
C) For td64/dt64/dt64tz/period, the *only* silent casting is to object, so
this completely gets rid of special-casing among that code
D) For IntegerArray, FloatingArray, IntervalArray leaves open the option of
allowing e.g. Integer->Floating casting (xref
https://github.com/pandas-dev/pandas/issues/25288#issuecomment-941762174)
E) Does not preclude later deciding on the stricter options in 2)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211026/9fe9d32e/attachment.html>


More information about the Pandas-dev mailing list