跨列计算数据框中的 null/NaN 个值

Question

我正在尝试计算数据框各列中每一行的唯一值的数量。

这是当前数据帧：

[in] df
[out] 
         PID         CID      PPID        PPPID       PPPPID        PPPPPID
    0   2015-01-02   456      2014-01-02  2014-01-02  2014-01-02    2014-01-02
    1   2015-02-02   500      2014-02-02  2013-02-02  2012-02-02    2012-02-10  
    2   2010-12-04   300      2010-12-04  2010-12-04  2010-12-04    2010-12-04

除 CID (contract_ID) 之外的所有列都是日期时间。我想在数据框中添加另一列来计算每行中唯一日期时间的数量（目的是找出 "chain" 中有多少合同）。

我尝试了 .count() 和 .sum() 方法的不同实现，但无法让它们逐行工作（输出是所有具有相同值的行).

示例：

df_merged['COUNT'] = df_merged2.count(axis=1)

当我希望每一行都不同时，用“6”填充整个 'COUNT' 列。

删除 axis=1 参数会使整个列 'NaN'

Answer 1

您需要 apply(your_func, axis=1) 才能逐行工作。

df

Out[19]: 
          PID  CID        PPID       PPPID      PPPPID     PPPPPID
0  2015-01-02  456  2014-01-02  2014-01-02  2014-01-02  2014-01-02
1  2015-02-02  500  2014-02-02  2013-02-02  2012-02-02  2012-02-10
2  2010-12-04  300  2010-12-04  2010-12-04  2010-12-04  2010-12-04



df['counts'] = df.drop('CID', axis=1).apply(lambda row: len(pd.unique(row)), axis=1)

Out[20]: 
          PID  CID        PPID       PPPID      PPPPID     PPPPPID  counts
0  2015-01-02  456  2014-01-02  2014-01-02  2014-01-02  2014-01-02       2
1  2015-02-02  500  2014-02-02  2013-02-02  2012-02-02  2012-02-10       5
2  2010-12-04  300  2010-12-04  2010-12-04  2010-12-04  2010-12-04       1

[3 rows x 7 columns]

Answer 2

另一种方法是在 df:

的转置上调用 unique

In [26]:    
df['counts'] = df.drop('CID', axis=1).T.apply(lambda x: len(pd.Series.unique(x)))
df

Out[26]:
          PID  CID        PPID       PPPID      PPPPID     PPPPPID  counts
0  2015-01-02  456  2014-01-02  2014-01-02  2014-01-02  2014-01-02       2
1  2015-02-02  500  2014-02-02  2013-02-02  2012-02-02  2012-02-10       5
2  2010-12-04  300  2010-12-04  2010-12-04  2010-12-04  2010-12-04       1

Answer 3

您可以直接在 DataFrame 上使用 nunique。这是从pd.__version__ == u'0.20.0'开始的。

In [169]: df['counts'] = df.drop('CID', axis=1).nunique(axis=1)

In [170]: df
Out[170]:
          PID  CID        PPID       PPPID      PPPPID     PPPPPID  counts
0  2015-01-02  456  2014-01-02  2014-01-02  2014-01-02  2014-01-02       2
1  2015-02-02  500  2014-02-02  2013-02-02  2012-02-02  2012-02-10       5
2  2010-12-04  300  2010-12-04  2010-12-04  2010-12-04  2010-12-04       1

跨列计算数据框中的 null/NaN 个值

Count null/NaN values in a dataframe across columns

python

datetime

nan

pandas