按日期问题分组的分类变量的二进制矢量化编码
Binary Vectorization Encoding for categorical variable grouped by date issue
我在尝试以某种二进制编码对其进行矢量化时遇到问题,但在有多于一行时进行聚合(因为分类变量的变化是非排他性的),但避免将其与其他日期合并。 (python 和 pandas)
假设这是数据
id1
id2
type
month.measure
105
50
growing
04-2020
105
50
advancing
04-2020
44
29
advancing
04-2020
105
50
retreating
05-2020
105
50
shrinking
05-2020
就这样结束了
id1
id2
growing
shrinking
advancing
retreating
month.measure
105
50
1
0
1
0
04-2020
44
29
0
0
1
0
04-2020
105
50
0
1
0
1
05-2020
我一直在尝试各种转换,lambda 函数,pandas get_dummies
并尝试将它们按 2 个 ID 和日期分组,但我找不到方法.
希望我们能解决!提前致谢! :)
此解决方案使用 pandas get_dummies
对“TYPE”列进行单热编码,然后将单热编码数据帧与原始数据帧连接起来,然后将 groupby 应用于 ID列和“MONTH”:
# Set up the dataframe
ID1 = [105,105,44,105,105]
ID2 = [50,50,29,50,50]
TYPE = ['growing','advancing','advancing','retreating','shrinking']
MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']
df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})
# Apply get_dummies and groupby operations
df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
.groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()
# These bits are just cosmetic to get the output to look more like your required output
df.columns = [c.upper() for c in df.columns]
col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']
df[['ID1','ID2']+col_order]
# ID1 ID2 GROWING SHRINKING ADVANCING RETREATING MONTH.MEASURE
# 0 44 29 0 0 1 0 04-2020
# 1 105 50 1 0 1 0 04-2020
# 2 105 50 0 1 0 1 05-2020
这是crosstab
:
pd.crosstab([df['id1'],df['id2'],df['month.measure']], df['type']).reset_index()
输出:
type id1 id2 month.measure advancing growing retreating shrinking
0 44 29 04-2020 1 0 0 0
1 105 50 04-2020 1 1 0 0
2 105 50 05-2020 0 0 1 1
我在尝试以某种二进制编码对其进行矢量化时遇到问题,但在有多于一行时进行聚合(因为分类变量的变化是非排他性的),但避免将其与其他日期合并。 (python 和 pandas)
假设这是数据
id1 | id2 | type | month.measure |
---|---|---|---|
105 | 50 | growing | 04-2020 |
105 | 50 | advancing | 04-2020 |
44 | 29 | advancing | 04-2020 |
105 | 50 | retreating | 05-2020 |
105 | 50 | shrinking | 05-2020 |
就这样结束了
id1 | id2 | growing | shrinking | advancing | retreating | month.measure |
---|---|---|---|---|---|---|
105 | 50 | 1 | 0 | 1 | 0 | 04-2020 |
44 | 29 | 0 | 0 | 1 | 0 | 04-2020 |
105 | 50 | 0 | 1 | 0 | 1 | 05-2020 |
我一直在尝试各种转换,lambda 函数,pandas get_dummies
并尝试将它们按 2 个 ID 和日期分组,但我找不到方法.
希望我们能解决!提前致谢! :)
此解决方案使用 pandas get_dummies
对“TYPE”列进行单热编码,然后将单热编码数据帧与原始数据帧连接起来,然后将 groupby 应用于 ID列和“MONTH”:
# Set up the dataframe
ID1 = [105,105,44,105,105]
ID2 = [50,50,29,50,50]
TYPE = ['growing','advancing','advancing','retreating','shrinking']
MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']
df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})
# Apply get_dummies and groupby operations
df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
.groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()
# These bits are just cosmetic to get the output to look more like your required output
df.columns = [c.upper() for c in df.columns]
col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']
df[['ID1','ID2']+col_order]
# ID1 ID2 GROWING SHRINKING ADVANCING RETREATING MONTH.MEASURE
# 0 44 29 0 0 1 0 04-2020
# 1 105 50 1 0 1 0 04-2020
# 2 105 50 0 1 0 1 05-2020
这是crosstab
:
pd.crosstab([df['id1'],df['id2'],df['month.measure']], df['type']).reset_index()
输出:
type id1 id2 month.measure advancing growing retreating shrinking
0 44 29 04-2020 1 0 0 0
1 105 50 04-2020 1 1 0 0
2 105 50 05-2020 0 0 1 1