Python 数据分组并根据日期进行比较

Python Data Grouping and compare on the basis of date

我有一组数据想做,如果最新的st_1或者st_2 比之前的 st_1st_2 大,将 True 或 False 分别放在另一列中。我怎样才能根据 dateid?

id  date                        st_1    st_2
1   2022-02-28 00:00:00+00:00   60.0    6.0
2   2021-10-31 00:00:00+00:00   70.0    0.0
2   2021-12-31 00:00:00+00:00   70.0    4.0
3   2021-10-31 00:00:00+00:00   60.0    0.0
4   2021-06-30 00:00:00+00:00   63.3    2.66
4   2021-08-31 00:00:00+00:00   60.0    3.0
4   2022-02-28 00:00:00+00:00   70.0    2.0
5   2021-06-30 00:00:00+00:00   70.0    3.0
4   2022-02-28 00:00:00+00:00   70.0    2.0
5   2021-06-30 00:00:00+00:00   70.0    3.0
5   2021-08-31 00:00:00+00:00   80.0    2.0
5   2021-10-31 00:00:00+00:00   70.0    3.5

我的预期结果:

id  date                        st_1    st_2  outcome
1   2022-02-28 00:00:00+00:00   60.0    6.0   false
2   2021-10-31 00:00:00+00:00   70.0    0.0   false
2   2021-12-31 00:00:00+00:00   70.0    4.0   true
3   2021-10-31 00:00:00+00:00   60.0    0.0   false
4   2021-06-30 00:00:00+00:00   63.3    2.66  false
4   2021-08-31 00:00:00+00:00   60.0    3.0   true
4   2022-02-28 00:00:00+00:00   70.0    2.0   true
5   2021-06-30 00:00:00+00:00   70.0    3.0   false 
5   2021-08-31 00:00:00+00:00   80.0    2.0   true
5   2021-10-31 00:00:00+00:00   70.0    3.5   true

更新 #2:我修复了排序,先按 id 排序,然后按日期排序,并添加了列 lag_id,现在用于确保仅在同一 id

内进行比较

更新:我刚刚注意到规范是“如果最新的 st_1 或 st_2 大于之前的 st_1 或 st_2”,这意味着正确的答案是使用“|”而不是原始答案的“&”。更正。

代码:

import io
import pandas as pd
string = """id  date st_1 st_2
1  "2022-02-28 00:00:00+00:00"  60.0    6.0
2  "2021-10-31 00:00:00+00:00"  70.0    0.0
2  "2021-12-31 00:00:00+00:00"  70.0    4.0
3  "2021-10-31 00:00:00+00:00"  60.0    0.0
4  "2021-06-30 00:00:00+00:00"  63.3    2.66
4  "2021-08-31 00:00:00+00:00"  60.0    3.0
4  "2022-02-28 00:00:00+00:00"  70.0    2.0
5  "2021-06-30 00:00:00+00:00"  70.0    3.0
4  "2022-02-28 00:00:00+00:00"  70.0    2.0
5  "2021-06-30 00:00:00+00:00"  70.0    3.0
5  "2021-08-31 00:00:00+00:00"  80.0    2.0
5  "2021-10-31 00:00:00+00:00"  70.0    3.5
"""
data = io.StringIO(string)
df = pd.read_csv(data, sep="\s+")  # Load df0 from the data string
df.sort_values(['id', 'date'], inplace=True)  # Sort according to the spec
print(df)

df['lag_id'] = df['id'].shift(1) # Lag the id column

df['lag_st_1'] = df['st_1'].shift(1)  # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1)  # Ditto for st_2
print(df)

# Create result column with True values where the right conditions are met
df.loc[(df['id'] == df['lag_id'])
        & (
              (df['st_1'] > df['lag_st_1'])
            | (df['st_2'] > df['lag_st_2'])
          ), 'result'] = True

# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)

更新后的输出:

    id                       date  st_1  st_2
0    1  2022-02-28 00:00:00+00:00  60.0  6.00
1    2  2021-10-31 00:00:00+00:00  70.0  0.00
2    2  2021-12-31 00:00:00+00:00  70.0  4.00
3    3  2021-10-31 00:00:00+00:00  60.0  0.00
4    4  2021-06-30 00:00:00+00:00  63.3  2.66
5    4  2021-08-31 00:00:00+00:00  60.0  3.00
6    4  2022-02-28 00:00:00+00:00  70.0  2.00
8    4  2022-02-28 00:00:00+00:00  70.0  2.00
7    5  2021-06-30 00:00:00+00:00  70.0  3.00
9    5  2021-06-30 00:00:00+00:00  70.0  3.00
10   5  2021-08-31 00:00:00+00:00  80.0  2.00
11   5  2021-10-31 00:00:00+00:00  70.0  3.50
    id                       date  st_1  st_2  lag_id  lag_st_1  lag_st_2
0    1  2022-02-28 00:00:00+00:00  60.0  6.00     NaN       NaN       NaN
1    2  2021-10-31 00:00:00+00:00  70.0  0.00     1.0      60.0      6.00
2    2  2021-12-31 00:00:00+00:00  70.0  4.00     2.0      70.0      0.00
3    3  2021-10-31 00:00:00+00:00  60.0  0.00     2.0      70.0      4.00
4    4  2021-06-30 00:00:00+00:00  63.3  2.66     3.0      60.0      0.00
5    4  2021-08-31 00:00:00+00:00  60.0  3.00     4.0      63.3      2.66
6    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      60.0      3.00
8    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      70.0      2.00
7    5  2021-06-30 00:00:00+00:00  70.0  3.00     4.0      70.0      2.00
9    5  2021-06-30 00:00:00+00:00  70.0  3.00     5.0      70.0      3.00
10   5  2021-08-31 00:00:00+00:00  80.0  2.00     5.0      70.0      3.00
11   5  2021-10-31 00:00:00+00:00  70.0  3.50     5.0      80.0      2.00
    id                       date  st_1  st_2  lag_id  lag_st_1  lag_st_2  result
0    1  2022-02-28 00:00:00+00:00  60.0  6.00     NaN       NaN       NaN   False
1    2  2021-10-31 00:00:00+00:00  70.0  0.00     1.0      60.0      6.00   False
2    2  2021-12-31 00:00:00+00:00  70.0  4.00     2.0      70.0      0.00    True
3    3  2021-10-31 00:00:00+00:00  60.0  0.00     2.0      70.0      4.00   False
4    4  2021-06-30 00:00:00+00:00  63.3  2.66     3.0      60.0      0.00   False
5    4  2021-08-31 00:00:00+00:00  60.0  3.00     4.0      63.3      2.66    True
6    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      60.0      3.00    True
8    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      70.0      2.00   False
7    5  2021-06-30 00:00:00+00:00  70.0  3.00     4.0      70.0      2.00   False
9    5  2021-06-30 00:00:00+00:00  70.0  3.00     5.0      70.0      3.00   False
10   5  2021-08-31 00:00:00+00:00  80.0  2.00     5.0      70.0      3.00    True
11   5  2021-10-31 00:00:00+00:00  70.0  3.50     5.0      80.0      2.00    True

IIUC,你想用日期检查条件,但没有一个日期与前一个日期进行测试。 C Pappy 的答案中的逻辑比这更好,但这只在日期组内进行检查,因此结果更少 'True'。请让我们知道哪个是正确的。

df.sort_values(['date', 'id'], inplace=True)
df['st_1_check'] = False
df['st_2_check'] = False

def test_conditions(x):
    if x.shape[0] > 1:
        x.loc[:, 'st_1_check'] = x['st_1'] - x['st_1'].shift(1)
        x.loc[:, 'st_2_check'] = x['st_2'] - x['st_2'].shift(1)
    return x

dfnew = df.groupby(['date']).apply(test_conditions)
dfnew.fillna(False, inplace=True)
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] > 0, True, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] > 0, True, dfnew['st_2_check'])

dfnew['st_1_check'] = np.where(dfnew['st_1_check'] <= 0, False, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] <= 0, False, dfnew['st_2_check'])
dfnew

    id                       date     st_1    st_2 st_1_check st_2_check
4    4  2021-06-30 00:00:00+00:00 63.30000 2.66000      False      False
7    5  2021-06-30 00:00:00+00:00 70.00000 3.00000       True       True
9    5  2021-06-30 00:00:00+00:00 70.00000 3.00000      False      False
5    4  2021-08-31 00:00:00+00:00 60.00000 3.00000      False      False
10   5  2021-08-31 00:00:00+00:00 80.00000 2.00000       True      False
1    2  2021-10-31 00:00:00+00:00 70.00000 0.00000      False      False
3    3  2021-10-31 00:00:00+00:00 60.00000 0.00000      False      False
11   5  2021-10-31 00:00:00+00:00 70.00000 3.50000       True       True
2    2  2021-12-31 00:00:00+00:00 70.00000 4.00000      False      False
0    1  2022-02-28 00:00:00+00:00 60.00000 6.00000      False      False
6    4  2022-02-28 00:00:00+00:00 70.00000 2.00000       True      False
8    4  2022-02-28 00:00:00+00:00 70.00000 2.00000      False      False

试试这个:

def func_date(x):
  if x==0:
    return 0
  elif df.at[x-1,'st_1']>df.at[x,'st_1']:
    return 'higher'
  elif df.at[x-1,'st_1']==df.at[x,'st_1']:
    return '='
  else:
    return 'less'

df['result']=df.index.map(func_date)
print(df)