Python 数据分组并根据日期进行比较
Python Data Grouping and compare on the basis of date
我有一组数据想做,如果最新的st_1或者st_2 比之前的 st_1 或 st_2 大,将 True 或 False 分别放在另一列中。我怎样才能根据 date 和 id?
id date st_1 st_2
1 2022-02-28 00:00:00+00:00 60.0 6.0
2 2021-10-31 00:00:00+00:00 70.0 0.0
2 2021-12-31 00:00:00+00:00 70.0 4.0
3 2021-10-31 00:00:00+00:00 60.0 0.0
4 2021-06-30 00:00:00+00:00 63.3 2.66
4 2021-08-31 00:00:00+00:00 60.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
5 2021-08-31 00:00:00+00:00 80.0 2.0
5 2021-10-31 00:00:00+00:00 70.0 3.5
我的预期结果:
id date st_1 st_2 outcome
1 2022-02-28 00:00:00+00:00 60.0 6.0 false
2 2021-10-31 00:00:00+00:00 70.0 0.0 false
2 2021-12-31 00:00:00+00:00 70.0 4.0 true
3 2021-10-31 00:00:00+00:00 60.0 0.0 false
4 2021-06-30 00:00:00+00:00 63.3 2.66 false
4 2021-08-31 00:00:00+00:00 60.0 3.0 true
4 2022-02-28 00:00:00+00:00 70.0 2.0 true
5 2021-06-30 00:00:00+00:00 70.0 3.0 false
5 2021-08-31 00:00:00+00:00 80.0 2.0 true
5 2021-10-31 00:00:00+00:00 70.0 3.5 true
更新 #2:我修复了排序,先按 id 排序,然后按日期排序,并添加了列 lag_id,现在用于确保仅在同一 id
内进行比较
更新:我刚刚注意到规范是“如果最新的 st_1 或 st_2 大于之前的 st_1 或 st_2”,这意味着正确的答案是使用“|”而不是原始答案的“&”。更正。
代码:
import io
import pandas as pd
string = """id date st_1 st_2
1 "2022-02-28 00:00:00+00:00" 60.0 6.0
2 "2021-10-31 00:00:00+00:00" 70.0 0.0
2 "2021-12-31 00:00:00+00:00" 70.0 4.0
3 "2021-10-31 00:00:00+00:00" 60.0 0.0
4 "2021-06-30 00:00:00+00:00" 63.3 2.66
4 "2021-08-31 00:00:00+00:00" 60.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
5 "2021-08-31 00:00:00+00:00" 80.0 2.0
5 "2021-10-31 00:00:00+00:00" 70.0 3.5
"""
data = io.StringIO(string)
df = pd.read_csv(data, sep="\s+") # Load df0 from the data string
df.sort_values(['id', 'date'], inplace=True) # Sort according to the spec
print(df)
df['lag_id'] = df['id'].shift(1) # Lag the id column
df['lag_st_1'] = df['st_1'].shift(1) # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1) # Ditto for st_2
print(df)
# Create result column with True values where the right conditions are met
df.loc[(df['id'] == df['lag_id'])
& (
(df['st_1'] > df['lag_st_1'])
| (df['st_2'] > df['lag_st_2'])
), 'result'] = True
# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)
更新后的输出:
id date st_1 st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00
1 2 2021-10-31 00:00:00+00:00 70.0 0.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66
5 4 2021-08-31 00:00:00+00:00 60.0 3.00
6 4 2022-02-28 00:00:00+00:00 70.0 2.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50
id date st_1 st_2 lag_id lag_st_1 lag_st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00
id date st_1 st_2 lag_id lag_st_1 lag_st_2 result
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN False
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00 False
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00 True
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00 False
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00 False
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66 True
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00 True
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00 False
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00 False
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00 False
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00 True
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00 True
IIUC,你想用日期检查条件,但没有一个日期与前一个日期进行测试。 C Pappy 的答案中的逻辑比这更好,但这只在日期组内进行检查,因此结果更少 'True'。请让我们知道哪个是正确的。
df.sort_values(['date', 'id'], inplace=True)
df['st_1_check'] = False
df['st_2_check'] = False
def test_conditions(x):
if x.shape[0] > 1:
x.loc[:, 'st_1_check'] = x['st_1'] - x['st_1'].shift(1)
x.loc[:, 'st_2_check'] = x['st_2'] - x['st_2'].shift(1)
return x
dfnew = df.groupby(['date']).apply(test_conditions)
dfnew.fillna(False, inplace=True)
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] > 0, True, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] > 0, True, dfnew['st_2_check'])
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] <= 0, False, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] <= 0, False, dfnew['st_2_check'])
dfnew
id date st_1 st_2 st_1_check st_2_check
4 4 2021-06-30 00:00:00+00:00 63.30000 2.66000 False False
7 5 2021-06-30 00:00:00+00:00 70.00000 3.00000 True True
9 5 2021-06-30 00:00:00+00:00 70.00000 3.00000 False False
5 4 2021-08-31 00:00:00+00:00 60.00000 3.00000 False False
10 5 2021-08-31 00:00:00+00:00 80.00000 2.00000 True False
1 2 2021-10-31 00:00:00+00:00 70.00000 0.00000 False False
3 3 2021-10-31 00:00:00+00:00 60.00000 0.00000 False False
11 5 2021-10-31 00:00:00+00:00 70.00000 3.50000 True True
2 2 2021-12-31 00:00:00+00:00 70.00000 4.00000 False False
0 1 2022-02-28 00:00:00+00:00 60.00000 6.00000 False False
6 4 2022-02-28 00:00:00+00:00 70.00000 2.00000 True False
8 4 2022-02-28 00:00:00+00:00 70.00000 2.00000 False False
试试这个:
def func_date(x):
if x==0:
return 0
elif df.at[x-1,'st_1']>df.at[x,'st_1']:
return 'higher'
elif df.at[x-1,'st_1']==df.at[x,'st_1']:
return '='
else:
return 'less'
df['result']=df.index.map(func_date)
print(df)
我有一组数据想做,如果最新的st_1或者st_2 比之前的 st_1 或 st_2 大,将 True 或 False 分别放在另一列中。我怎样才能根据 date 和 id?
id date st_1 st_2
1 2022-02-28 00:00:00+00:00 60.0 6.0
2 2021-10-31 00:00:00+00:00 70.0 0.0
2 2021-12-31 00:00:00+00:00 70.0 4.0
3 2021-10-31 00:00:00+00:00 60.0 0.0
4 2021-06-30 00:00:00+00:00 63.3 2.66
4 2021-08-31 00:00:00+00:00 60.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
5 2021-08-31 00:00:00+00:00 80.0 2.0
5 2021-10-31 00:00:00+00:00 70.0 3.5
我的预期结果:
id date st_1 st_2 outcome
1 2022-02-28 00:00:00+00:00 60.0 6.0 false
2 2021-10-31 00:00:00+00:00 70.0 0.0 false
2 2021-12-31 00:00:00+00:00 70.0 4.0 true
3 2021-10-31 00:00:00+00:00 60.0 0.0 false
4 2021-06-30 00:00:00+00:00 63.3 2.66 false
4 2021-08-31 00:00:00+00:00 60.0 3.0 true
4 2022-02-28 00:00:00+00:00 70.0 2.0 true
5 2021-06-30 00:00:00+00:00 70.0 3.0 false
5 2021-08-31 00:00:00+00:00 80.0 2.0 true
5 2021-10-31 00:00:00+00:00 70.0 3.5 true
更新 #2:我修复了排序,先按 id 排序,然后按日期排序,并添加了列 lag_id,现在用于确保仅在同一 id
内进行比较更新:我刚刚注意到规范是“如果最新的 st_1 或 st_2 大于之前的 st_1 或 st_2”,这意味着正确的答案是使用“|”而不是原始答案的“&”。更正。
代码:
import io
import pandas as pd
string = """id date st_1 st_2
1 "2022-02-28 00:00:00+00:00" 60.0 6.0
2 "2021-10-31 00:00:00+00:00" 70.0 0.0
2 "2021-12-31 00:00:00+00:00" 70.0 4.0
3 "2021-10-31 00:00:00+00:00" 60.0 0.0
4 "2021-06-30 00:00:00+00:00" 63.3 2.66
4 "2021-08-31 00:00:00+00:00" 60.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
5 "2021-08-31 00:00:00+00:00" 80.0 2.0
5 "2021-10-31 00:00:00+00:00" 70.0 3.5
"""
data = io.StringIO(string)
df = pd.read_csv(data, sep="\s+") # Load df0 from the data string
df.sort_values(['id', 'date'], inplace=True) # Sort according to the spec
print(df)
df['lag_id'] = df['id'].shift(1) # Lag the id column
df['lag_st_1'] = df['st_1'].shift(1) # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1) # Ditto for st_2
print(df)
# Create result column with True values where the right conditions are met
df.loc[(df['id'] == df['lag_id'])
& (
(df['st_1'] > df['lag_st_1'])
| (df['st_2'] > df['lag_st_2'])
), 'result'] = True
# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)
更新后的输出:
id date st_1 st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00
1 2 2021-10-31 00:00:00+00:00 70.0 0.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66
5 4 2021-08-31 00:00:00+00:00 60.0 3.00
6 4 2022-02-28 00:00:00+00:00 70.0 2.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50
id date st_1 st_2 lag_id lag_st_1 lag_st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00
id date st_1 st_2 lag_id lag_st_1 lag_st_2 result
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN False
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00 False
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00 True
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00 False
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00 False
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66 True
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00 True
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00 False
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00 False
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00 False
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00 True
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00 True
IIUC,你想用日期检查条件,但没有一个日期与前一个日期进行测试。 C Pappy 的答案中的逻辑比这更好,但这只在日期组内进行检查,因此结果更少 'True'。请让我们知道哪个是正确的。
df.sort_values(['date', 'id'], inplace=True)
df['st_1_check'] = False
df['st_2_check'] = False
def test_conditions(x):
if x.shape[0] > 1:
x.loc[:, 'st_1_check'] = x['st_1'] - x['st_1'].shift(1)
x.loc[:, 'st_2_check'] = x['st_2'] - x['st_2'].shift(1)
return x
dfnew = df.groupby(['date']).apply(test_conditions)
dfnew.fillna(False, inplace=True)
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] > 0, True, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] > 0, True, dfnew['st_2_check'])
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] <= 0, False, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] <= 0, False, dfnew['st_2_check'])
dfnew
id date st_1 st_2 st_1_check st_2_check
4 4 2021-06-30 00:00:00+00:00 63.30000 2.66000 False False
7 5 2021-06-30 00:00:00+00:00 70.00000 3.00000 True True
9 5 2021-06-30 00:00:00+00:00 70.00000 3.00000 False False
5 4 2021-08-31 00:00:00+00:00 60.00000 3.00000 False False
10 5 2021-08-31 00:00:00+00:00 80.00000 2.00000 True False
1 2 2021-10-31 00:00:00+00:00 70.00000 0.00000 False False
3 3 2021-10-31 00:00:00+00:00 60.00000 0.00000 False False
11 5 2021-10-31 00:00:00+00:00 70.00000 3.50000 True True
2 2 2021-12-31 00:00:00+00:00 70.00000 4.00000 False False
0 1 2022-02-28 00:00:00+00:00 60.00000 6.00000 False False
6 4 2022-02-28 00:00:00+00:00 70.00000 2.00000 True False
8 4 2022-02-28 00:00:00+00:00 70.00000 2.00000 False False
试试这个:
def func_date(x):
if x==0:
return 0
elif df.at[x-1,'st_1']>df.at[x,'st_1']:
return 'higher'
elif df.at[x-1,'st_1']==df.at[x,'st_1']:
return '='
else:
return 'less'
df['result']=df.index.map(func_date)
print(df)