使用 Pandas 对数据框中的数据范围进行分类
Categorizing ranges of data in a dataframe using Pandas
我有一个时间序列数据框,其中包含来自多个站点的数据,如下所示:
Site Date Variable
1 01/01/2021 -1
1 02/01/2021 0
1 03/01/2021 1
1 04/01/2021 0
1 05/01/2021 -1
1 06/01/2021 0
1 07/01/2021 1
1 08/01/2021 2
1 09/01/2021 1
1 10/01/2021 0
2 01/01/2021 -5
2 02/01/2021 3
2 03/01/2021 2
2 04/01/2021 6
2 05/01/2021 -3
2 06/01/2021 3
2 07/01/2021 1
2 08/01/2021 -4
2 09/01/2021 -5
2 10/01/2021 -1
绘制的数据如下所示,某些站点的范围较大,而其他站点的范围较小:
我想找到一种方法将数据分类到 'high' 和 'low' 范围内的组中,例如站点 1 将被归入范围从 -2 到到2。我想这些必须由我手动设置,没关系。
我已经尝试过垃圾箱和动态垃圾箱,但据我所知,这些只能对单个变量进行分类,而我需要将 [站点] 作为一个整体来查看并根据以下类别进行分类每个站点内的全部数据。最后我需要这样的东西:
Site Date Variable Type
1 01/01/2021 -1 LOW
1 02/01/2021 0 LOW
1 03/01/2021 1 LOW
1 04/01/2021 0 LOW
1 05/01/2021 -1 LOW
1 06/01/2021 0 LOW
1 07/01/2021 1 LOW
1 08/01/2021 2 LOW
1 09/01/2021 1 LOW
1 10/01/2021 0 LOW
2 01/01/2021 -5 HIGH
2 02/01/2021 3 HIGH
2 03/01/2021 2 HIGH
2 04/01/2021 6 HIGH
2 05/01/2021 -3 HIGH
2 06/01/2021 3 HIGH
2 07/01/2021 1 HIGH
2 08/01/2021 -4 HIGH
2 09/01/2021 -5 HIGH
2 10/01/2021 -1 HIGH
你可以计算每组的范围 (=max-min) 并根据阈值定义 HIGH/LOW(我在这里使用 3):
df['Type'] = (df.groupby('Site')
['Variable']
.transform(lambda g: 'HIGH' if g.max()-g.min() > 3 else 'LOW')
)
输出:
Site Date Variable Type
0 1 01/01/2021 -1 LOW
1 1 02/01/2021 0 LOW
2 1 03/01/2021 1 LOW
3 1 04/01/2021 0 LOW
4 1 05/01/2021 -1 LOW
5 1 06/01/2021 0 LOW
6 1 07/01/2021 1 LOW
7 1 08/01/2021 2 LOW
8 1 09/01/2021 1 LOW
9 1 10/01/2021 0 LOW
10 2 01/01/2021 -5 HIGH
11 2 02/01/2021 3 HIGH
12 2 03/01/2021 2 HIGH
13 2 04/01/2021 6 HIGH
14 2 05/01/2021 -3 HIGH
15 2 06/01/2021 3 HIGH
16 2 07/01/2021 1 HIGH
17 2 08/01/2021 -4 HIGH
18 2 09/01/2021 -5 HIGH
19 2 10/01/2021 -1 HIGH
对于任意数量的类别,使用 pandas.cut
:
df['range'] = (df.groupby('Site')['Variable']
.transform(lambda g: g.max()-g.min())
)
# group_name: upper bound
groups = {'LOW': 0, 'MEDIUM': 3, 'HIGH': 12}
df['Type'] = pd.cut(df['range'],
bins=list(groups.values())+[float('inf')],
labels=list(groups)
)
输出:
Site Date Variable Type range
0 1 01/01/2021 -1 LOW 3
1 1 02/01/2021 0 LOW 3
2 1 03/01/2021 1 LOW 3
3 1 04/01/2021 0 LOW 3
4 1 05/01/2021 -1 LOW 3
5 1 06/01/2021 0 LOW 3
6 1 07/01/2021 1 LOW 3
7 1 08/01/2021 2 LOW 3
8 1 09/01/2021 1 LOW 3
9 1 10/01/2021 0 LOW 3
10 2 01/01/2021 -5 MEDIUM 11
11 2 02/01/2021 3 MEDIUM 11
12 2 03/01/2021 2 MEDIUM 11
13 2 04/01/2021 6 MEDIUM 11
14 2 05/01/2021 -3 MEDIUM 11
15 2 06/01/2021 3 MEDIUM 11
16 2 07/01/2021 1 MEDIUM 11
17 2 08/01/2021 -4 MEDIUM 11
18 2 09/01/2021 -5 MEDIUM 11
19 2 10/01/2021 -1 MEDIUM 11
我有一个时间序列数据框,其中包含来自多个站点的数据,如下所示:
Site Date Variable
1 01/01/2021 -1
1 02/01/2021 0
1 03/01/2021 1
1 04/01/2021 0
1 05/01/2021 -1
1 06/01/2021 0
1 07/01/2021 1
1 08/01/2021 2
1 09/01/2021 1
1 10/01/2021 0
2 01/01/2021 -5
2 02/01/2021 3
2 03/01/2021 2
2 04/01/2021 6
2 05/01/2021 -3
2 06/01/2021 3
2 07/01/2021 1
2 08/01/2021 -4
2 09/01/2021 -5
2 10/01/2021 -1
绘制的数据如下所示,某些站点的范围较大,而其他站点的范围较小:
我想找到一种方法将数据分类到 'high' 和 'low' 范围内的组中,例如站点 1 将被归入范围从 -2 到到2。我想这些必须由我手动设置,没关系。
我已经尝试过垃圾箱和动态垃圾箱,但据我所知,这些只能对单个变量进行分类,而我需要将 [站点] 作为一个整体来查看并根据以下类别进行分类每个站点内的全部数据。最后我需要这样的东西:
Site Date Variable Type
1 01/01/2021 -1 LOW
1 02/01/2021 0 LOW
1 03/01/2021 1 LOW
1 04/01/2021 0 LOW
1 05/01/2021 -1 LOW
1 06/01/2021 0 LOW
1 07/01/2021 1 LOW
1 08/01/2021 2 LOW
1 09/01/2021 1 LOW
1 10/01/2021 0 LOW
2 01/01/2021 -5 HIGH
2 02/01/2021 3 HIGH
2 03/01/2021 2 HIGH
2 04/01/2021 6 HIGH
2 05/01/2021 -3 HIGH
2 06/01/2021 3 HIGH
2 07/01/2021 1 HIGH
2 08/01/2021 -4 HIGH
2 09/01/2021 -5 HIGH
2 10/01/2021 -1 HIGH
你可以计算每组的范围 (=max-min) 并根据阈值定义 HIGH/LOW(我在这里使用 3):
df['Type'] = (df.groupby('Site')
['Variable']
.transform(lambda g: 'HIGH' if g.max()-g.min() > 3 else 'LOW')
)
输出:
Site Date Variable Type
0 1 01/01/2021 -1 LOW
1 1 02/01/2021 0 LOW
2 1 03/01/2021 1 LOW
3 1 04/01/2021 0 LOW
4 1 05/01/2021 -1 LOW
5 1 06/01/2021 0 LOW
6 1 07/01/2021 1 LOW
7 1 08/01/2021 2 LOW
8 1 09/01/2021 1 LOW
9 1 10/01/2021 0 LOW
10 2 01/01/2021 -5 HIGH
11 2 02/01/2021 3 HIGH
12 2 03/01/2021 2 HIGH
13 2 04/01/2021 6 HIGH
14 2 05/01/2021 -3 HIGH
15 2 06/01/2021 3 HIGH
16 2 07/01/2021 1 HIGH
17 2 08/01/2021 -4 HIGH
18 2 09/01/2021 -5 HIGH
19 2 10/01/2021 -1 HIGH
对于任意数量的类别,使用 pandas.cut
:
df['range'] = (df.groupby('Site')['Variable']
.transform(lambda g: g.max()-g.min())
)
# group_name: upper bound
groups = {'LOW': 0, 'MEDIUM': 3, 'HIGH': 12}
df['Type'] = pd.cut(df['range'],
bins=list(groups.values())+[float('inf')],
labels=list(groups)
)
输出:
Site Date Variable Type range
0 1 01/01/2021 -1 LOW 3
1 1 02/01/2021 0 LOW 3
2 1 03/01/2021 1 LOW 3
3 1 04/01/2021 0 LOW 3
4 1 05/01/2021 -1 LOW 3
5 1 06/01/2021 0 LOW 3
6 1 07/01/2021 1 LOW 3
7 1 08/01/2021 2 LOW 3
8 1 09/01/2021 1 LOW 3
9 1 10/01/2021 0 LOW 3
10 2 01/01/2021 -5 MEDIUM 11
11 2 02/01/2021 3 MEDIUM 11
12 2 03/01/2021 2 MEDIUM 11
13 2 04/01/2021 6 MEDIUM 11
14 2 05/01/2021 -3 MEDIUM 11
15 2 06/01/2021 3 MEDIUM 11
16 2 07/01/2021 1 MEDIUM 11
17 2 08/01/2021 -4 MEDIUM 11
18 2 09/01/2021 -5 MEDIUM 11
19 2 10/01/2021 -1 MEDIUM 11