Pandas 对特定条件进行计数和求和 returns 只有 nan

Pandas counting and suming specific conditions returns only nan

我正在尝试遵循线程 pandas-counting-and-summing-specific-conditions 中提供的其他 excel 借出解决方案,但代码只输出 nan 值,并且使用总和(不计数)给出未来警告.

基本上,对于我的 df 中的每一行,我想计算一列中有多少日期在同一列中其他日期的 +/- 1 天范围内。

如果我在 excel 中这样做,则可能出现以下多条件和积或计数:

= SUMPRODUCT(--(AN2>=$AN:$AN000-1),--(AN2<=$AN:$AN000+1)),

=countifs($AN:$AN000,">="&AN2-1,$AN:$AN000,"<="&AN2+1)

在 python 中尝试 linked 线程中的方法,我相信代码将是:

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
                             (df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()

输出应该是:2, 2, 2, 3, 3, 2。 取而代之的是全麦馒头!

是否正确假设因为我正在测试条件,所以我求和或计数并不重要?如果我需要求和,我会收到关于无效列(列有效)的未来警告,我不明白。但大多数情况下,我的问题是为什么我只得到 nan?

我认为您尝试总结的内容与您尝试应用的逻辑不符。

使用以下代码:

创建一个函数来计算该范围内的天数 为每一行调用该函数并将其保存为新列的值

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

def get_dates_in_range(df_copy, row):
    return df_copy[(df_copy['datet'] <= row['datet'] + datetime.timedelta(days=1)) &\
                     (df_copy['datet'] >= row['datet'] - datetime.timedelta(days=1))].shape[0]
    
    
df["caseIntensity"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)


       datet    caseIntensity
0   2020-03-04  2
1   2020-03-05  2
2   2020-03-09  2
3   2020-03-10  3
4   2020-03-11  3
5   2020-03-12  2

apply 中的循环可以使用矢量化解决方案,首先创建由 & 链接的 numpy 数组,比较并计算 Trues 可以使用 sum

a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
    
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])

df["caseIntensity"]  = mask.sum(axis=1)
print (df)
       datet  caseIntensity
0 2020-03-04              2
1 2020-03-05              2
2 2020-03-09              2
3 2020-03-10              3
4 2020-03-11              3
5 2020-03-12              2

这是 6k 行的性能:

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                         pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                         pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)


In [140]: %%timeit
     ...: a = df['datet']
     ...: b = a + pd.Timedelta(days=1)
     ...: c = a - pd.Timedelta(days=1)
     ...:     
     ...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
     ...: 
     ...: df["caseIntensity"]  = mask.sum(axis=1)
     ...: 
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [141]: %%timeit
     ...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
     ...: 
     ...: 
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)