Pandas Dataframe - 根据条件列上的累计和记录行数
Pandas Dataframe - record number of rows based on cumulative sum on a column with a condition
在 df 下面,我已经有了“A”列。我正在尝试添加另一列“Desired”,其中值是相应 A 值下方的行数,首先满足:A 值的累积总和 >= 8
例如:“期望”列的第 1 行将为 3,因为 5+2+3 >= 8。“期望”列的第 2 行将为 4,因为 2+3+2+2>=8
因此,理想的新 df 如下所示。
df:
A
Desired
8
3
5
4
2
4
3
4
2
3
2
2
1
1
11
1
8
NA
6
NA
使用 cumsum()
和一个 for 循环:
df = pd.DataFrame({'A':[8,5,2,3,2,2,1,11,8,6]})
cumsum_arr = df['A'].cumsum().values
desired = np.zeros(len(df))
for i in range(len(df)):
desired[i] = np.argmax((cumsum_arr[i:] - cumsum_arr[i])>=8)
df['desrired'] = desired
df['desrired'] = df['desrired'].replace(0, np.nan)
A desrired
0 8 3.0
1 5 4.0
2 2 4.0
3 3 4.0
4 2 3.0
5 2 2.0
6 1 1.0
7 11 1.0
8 8 NaN
9 6 NaN
使用rolling()
window 无需任何循环即可实现
df = pd.read_csv(io.StringIO("""|A|Desired|
|8 |3 |
|5 |4 |
|2 |4 |
|3 |4 |
|2 |3 |
|2 |2 |
|1 |1 |
|11 |1 |
|8 |NA |
|6 |NA |"""),sep="|")
df = df.drop(columns=[c for c in df.columns if "Unnamed" in c])
df["Desired"] = pd.to_numeric(df["Desired"], errors="coerce").astype("Int64")
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html see example
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=len(df))
df["DesiredCalc"] = (df["A"]
# looking at rows after current row
.shift(-1)
.rolling(indexer, min_periods=1)
# if any result of cumsum()>=8 then return zero based index + 1, else no result
.apply(lambda x: np.where(np.cumsum(x).ge(8).any(), np.argmax(np.cumsum(x).ge(8)) + 1, np.nan))
.astype("Int64")
)
输出
A Desired DesiredCalc
8 3 3
5 4 4
2 4 4
3 4 4
2 3 3
2 2 2
1 1 1
11 1 1
8 <NA> <NA>
6 <NA> <NA>
在 df 下面,我已经有了“A”列。我正在尝试添加另一列“Desired”,其中值是相应 A 值下方的行数,首先满足:A 值的累积总和 >= 8
例如:“期望”列的第 1 行将为 3,因为 5+2+3 >= 8。“期望”列的第 2 行将为 4,因为 2+3+2+2>=8
因此,理想的新 df 如下所示。
df:
A | Desired |
---|---|
8 | 3 |
5 | 4 |
2 | 4 |
3 | 4 |
2 | 3 |
2 | 2 |
1 | 1 |
11 | 1 |
8 | NA |
6 | NA |
使用 cumsum()
和一个 for 循环:
df = pd.DataFrame({'A':[8,5,2,3,2,2,1,11,8,6]})
cumsum_arr = df['A'].cumsum().values
desired = np.zeros(len(df))
for i in range(len(df)):
desired[i] = np.argmax((cumsum_arr[i:] - cumsum_arr[i])>=8)
df['desrired'] = desired
df['desrired'] = df['desrired'].replace(0, np.nan)
A desrired
0 8 3.0
1 5 4.0
2 2 4.0
3 3 4.0
4 2 3.0
5 2 2.0
6 1 1.0
7 11 1.0
8 8 NaN
9 6 NaN
使用rolling()
window 无需任何循环即可实现
df = pd.read_csv(io.StringIO("""|A|Desired|
|8 |3 |
|5 |4 |
|2 |4 |
|3 |4 |
|2 |3 |
|2 |2 |
|1 |1 |
|11 |1 |
|8 |NA |
|6 |NA |"""),sep="|")
df = df.drop(columns=[c for c in df.columns if "Unnamed" in c])
df["Desired"] = pd.to_numeric(df["Desired"], errors="coerce").astype("Int64")
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html see example
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=len(df))
df["DesiredCalc"] = (df["A"]
# looking at rows after current row
.shift(-1)
.rolling(indexer, min_periods=1)
# if any result of cumsum()>=8 then return zero based index + 1, else no result
.apply(lambda x: np.where(np.cumsum(x).ge(8).any(), np.argmax(np.cumsum(x).ge(8)) + 1, np.nan))
.astype("Int64")
)
输出
A Desired DesiredCalc
8 3 3
5 4 4
2 4 4
3 4 4
2 3 3
2 2 2
1 1 1
11 1 1
8 <NA> <NA>
6 <NA> <NA>