Pandas 对特定条件进行计数和求和 returns 只有 nan
Pandas counting and suming specific conditions returns only nan
我正在尝试遵循线程 pandas-counting-and-summing-specific-conditions 中提供的其他 excel 借出解决方案,但代码只输出 nan 值,并且使用总和(不计数)给出未来警告.
基本上,对于我的 df 中的每一行,我想计算一列中有多少日期在同一列中其他日期的 +/- 1 天范围内。
如果我在 excel 中这样做,则可能出现以下多条件和积或计数:
= SUMPRODUCT(--(AN2>=$AN:$AN000-1),--(AN2<=$AN:$AN000+1)),
或
=countifs($AN:$AN000,">="&AN2-1,$AN:$AN000,"<="&AN2+1)
在 python 中尝试 linked 线程中的方法,我相信代码将是:
import pandas as pd
import datetime
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
(df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()
输出应该是:2, 2, 2, 3, 3, 2。
取而代之的是全麦馒头!
是否正确假设因为我正在测试条件,所以我求和或计数并不重要?如果我需要求和,我会收到关于无效列(列有效)的未来警告,我不明白。但大多数情况下,我的问题是为什么我只得到 nan?
我认为您尝试总结的内容与您尝试应用的逻辑不符。
使用以下代码:
创建一个函数来计算该范围内的天数
为每一行调用该函数并将其保存为新列的值
import pandas as pd
import datetime
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
def get_dates_in_range(df_copy, row):
return df_copy[(df_copy['datet'] <= row['datet'] + datetime.timedelta(days=1)) &\
(df_copy['datet'] >= row['datet'] - datetime.timedelta(days=1))].shape[0]
df["caseIntensity"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2
apply
中的循环可以使用矢量化解决方案,首先创建由 &
链接的 numpy 数组,比较并计算 True
s 可以使用 sum
:
a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
df["caseIntensity"] = mask.sum(axis=1)
print (df)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2
这是 6k 行的性能:
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)
In [140]: %%timeit
...: a = df['datet']
...: b = a + pd.Timedelta(days=1)
...: c = a - pd.Timedelta(days=1)
...:
...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
...:
...: df["caseIntensity"] = mask.sum(axis=1)
...:
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [141]: %%timeit
...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
...:
...:
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我正在尝试遵循线程 pandas-counting-and-summing-specific-conditions 中提供的其他 excel 借出解决方案,但代码只输出 nan 值,并且使用总和(不计数)给出未来警告.
基本上,对于我的 df 中的每一行,我想计算一列中有多少日期在同一列中其他日期的 +/- 1 天范围内。
如果我在 excel 中这样做,则可能出现以下多条件和积或计数:
= SUMPRODUCT(--(AN2>=$AN:$AN000-1),--(AN2<=$AN:$AN000+1)),
或
=countifs($AN:$AN000,">="&AN2-1,$AN:$AN000,"<="&AN2+1)
在 python 中尝试 linked 线程中的方法,我相信代码将是:
import pandas as pd
import datetime
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
(df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()
输出应该是:2, 2, 2, 3, 3, 2。 取而代之的是全麦馒头!
是否正确假设因为我正在测试条件,所以我求和或计数并不重要?如果我需要求和,我会收到关于无效列(列有效)的未来警告,我不明白。但大多数情况下,我的问题是为什么我只得到 nan?
我认为您尝试总结的内容与您尝试应用的逻辑不符。
使用以下代码:
创建一个函数来计算该范围内的天数 为每一行调用该函数并将其保存为新列的值
import pandas as pd
import datetime
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
def get_dates_in_range(df_copy, row):
return df_copy[(df_copy['datet'] <= row['datet'] + datetime.timedelta(days=1)) &\
(df_copy['datet'] >= row['datet'] - datetime.timedelta(days=1))].shape[0]
df["caseIntensity"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2
apply
中的循环可以使用矢量化解决方案,首先创建由 &
链接的 numpy 数组,比较并计算 True
s 可以使用 sum
:
a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
df["caseIntensity"] = mask.sum(axis=1)
print (df)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2
这是 6k 行的性能:
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)
In [140]: %%timeit
...: a = df['datet']
...: b = a + pd.Timedelta(days=1)
...: c = a - pd.Timedelta(days=1)
...:
...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
...:
...: df["caseIntensity"] = mask.sum(axis=1)
...:
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [141]: %%timeit
...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
...:
...:
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)