当列包含字符串值时 pandas DataFrame.sum 的奇怪行为

Question

我有 3 个 pandas 调查响应数据框，它们看起来完全相同但创建方式不同：

import pandas as pd

df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df2.loc[1,2] = 'hey'

df3 = pd.DataFrame(index=range(3), columns=range(3))
for i in range(3):
    for j in range(3):
        if (i,j) != (1,2):
            df3.loc[i,j] = i*3 + j + 1
        else:
            df3.loc[i,j] = 'hey'

# df1, df2, df3 look the same as below
   0  1    2
0  1  2    3
1  4  5  hey
2  7  8    9

现在，当我对各列求和时，它们都会给我相同的结果。

sumcol1 = df1.sum()
sumcol2 = df2.sum()
sumcol3 = df3.sum()

# sumcol1, sumcol2, sumcol3 look the same as below
0    12
1    15
dtype: int64

但是，当我对各行求和时，df3 给出的结果与 df1 和 df2 不同。

此外，似乎当 axis=0 时，不会计算包含字符串的列的总和，而当 axis=1 时，所有行的总和将使用属于跳过字符串元素的列的元素进行计算。

sumrow1 = df1.sum(axis=1)
sumrow2 = df2.sum(axis=1)
sumrow3 = df3.sum(axis=1)

#sumrow1
0     3
1     9
2    15
dtype: int64

#sumrow2
0     3
1     9
2    15
dtype: int64

#sumrow3
0    0.0
1    0.0
2    0.0
dtype: float64

关于这个我有 3 个问题。

是什么导致了 sumcol1 和 sumrow1 之间的不同行为？
是什么导致了 sumrow1 和 sumrow3 之间的不同行为？
是否有正确的方法来获得与 sumrow1 和 df3 相同的结果？

已添加：

有没有一种聪明的方法可以在保留字符串的同时只添加数值？

我目前的解决方法（感谢 jpp 的友善回答）：

df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
df_c = df.copy()
for col in df.select_dtypes(['object']).columns:
    df_c[col] = pd.to_numeric(df_c[col], errors='coerce')
df['sum'] = df_c.sum(axis=1)

#result
   0  1    2   sum
0  1  2    3   6.0
1  4  5  hey   9.0
2  7  8    9  24.0

我正在使用 Python 3.6.6，pandas 0.23.4。

Answer 1

有几个问题：

主要问题是您 df3 的构造有 all 三个系列 具有 dtype object，而 df1 和 df2 具有 dtype=int 前两个系列。
Pandas 数据帧中的数据按系列 [列] 组织和存储。因此，按系列执行类型转换。因此，"rows and columns" 求和的逻辑必然不同，并且不一定与混合类型一致。

要了解第一个问题发生了什么，您必须了解 Pandas 不会 持续检查 在每次操作后选择最合适的数据类型。这将非常昂贵。

您可以自己查看dtypes：

print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes}) {'df1': 0 int64 1 int64 2 object dtype: object, 'df2': 0 int64 1 int64 2 object dtype: object, 'df3': 0 object 1 object 2 object dtype: object}

您可以通过检查是否有任何空值结果的操作选择性地将转换应用到 df3 - 转换：

for col in df3.select_dtypes(['object']).columns: col_num = pd.to_numeric(df3[col], errors='coerce') if not col_num.isnull().any(): # check if any null values df3[col] = col_num # assign numeric series print(df3.dtypes) 0 int64 1 int64 2 object dtype: object

然后您应该看到一致的治疗。在这一点上，值得丢弃您原来的 df3：在任何地方都没有记录连续系列类型检查 can 或 should 被应用每次手术后。

要在跨行或列求和时忽略非数字值，您可以通过 pd.to_numeric 和 errors='coerce':
强制转换
df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]]) col_sum = df.apply(pd.to_numeric, errors='coerce').sum() row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1) print(col_sum) 0 12.0 1 15.0 2 12.0 dtype: float64 print(row_sum) 0 6.0 1 9.0 2 24.0 dtype: float64

Answer 2

根据你的问题和 jpp 的诊断，数据帧看起来相同，但它们在第 3 列上的 dtype 不同。

以下是一些揭示差异的比较方法：

>>> df1.equals(df3)
False # not so useful, doesn't tell you why they differ

你真正需要的是 pandas.testing.assert_frame_equal :

>>> import pandas.testing
>>> pandas.testing.assert_frame_equal(df1, df3)

AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int64
[right]: object

pandas.testing.assert_frame_equal() 具有以下有用参数的厨房水槽，您可以根据需要自定义：

check_dtype : bool, default True    
Whether to check the DataFrame dtype is identical.

check_index_type : bool / string {‘equiv’}, default False    
Whether to check the Index class, dtype and inferred_type are identical.

check_column_type : bool / string {‘equiv’}, default False    
Whether to check the columns class, dtype and inferred_type are identical.

check_frame_type : bool, default False    
Whether to check the DataFrame class is identical.

check_less_precise : bool or int, default False    
Specify comparison precision. Only used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal points are compared. If int, then specify the digits to compare

check_names : bool, default True    
Whether to check the Index names attribute.

by_blocks : bool, default False    
Specify how to compare internal data. If False, compare by columns. If True, compare by blocks.

check_exact : bool, default False    
Whether to compare number exactly.

check_datetimelike_compat : bool, default False    
Compare datetime-like which is comparable ignoring dtype.

check_categorical : bool, default True    
Whether to compare internal Categorical exactly.

check_like : bool, default False    
If true, ignore the order of rows & columns

当列包含字符串值时 pandas DataFrame.sum 的奇怪行为

Odd behaviour of pandas DataFrame.sum when column contains string value

python

types

coercion

dataframe

pandas