熔化一个 pandas DataFrame
Melt a pandas DataFrame
我有一个 pandas DataFrame
这样的:
df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})
In [54]: df
Out[54]:
custid prod1 prod1_hnode1 prod1_hnode2 prod2 prod2_hnode1 \
0 1 jeans 1 6 tshirt 2
1 2 tshirt 2 7 jeans 1
2 3 jacket 3 8 jacket 3
3 4 tshirt 2 7 shirt 4
prod2_hnode2
0 7
1 6
2 8
3 7
如何将其转换为以下格式:
dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})
In [56]: dfnew
Out[56]:
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
使用:
set_index
按列 custid
- 按
split
在列中创建 MultiIndex
- 将列中的
NaN
替换为 rec
stack
一级
reset_index
来自 MultiIndex
的列
- 重命名列
df = df.set_index('custid')
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns={np.nan:'rec'})
cols = ['custid','hnode1','hnode2','prod','rec']
df = df.stack(0).reset_index().rename(columns={'level_1':'prod'}).reindex(columns=cols)
print (df)
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
这是另一种应该有效的方法,但使用了重复的 melt
s。
coln = df.dtypes.index # save some typing
df_long = pd.melt(
df, id_vars = "custid", value_vars = ["prod1", "prod2"],
var_name = "prod", value_name = "rec").assign(
hnode1 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode1" in x, coln))["value"],
hnode2 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode2" in x, coln))["value"])
print(df_long)
custid prod rec hnode1 hnode2
0 1 prod1 jeans 1 6
1 2 prod1 tshirt 2 7
2 3 prod1 jacket 3 8
3 4 prod1 tshirt 2 7
4 1 prod2 tshirt 2 7
5 2 prod2 jeans 1 6
6 3 prod2 jacket 3 8
7 4 prod2 shirt 4 7
您在评论中提到了 R。 "data.table" 中的 melt
应该能够更轻松地处理这个问题,因为您可以一次融化多组列,类似于您可能如何处理基本 R 的 reshape
函数的问题。
基础 R 方法可能类似于:
reshape(df, direction = "long", idvar = "custid",
varying = list(c(2, 5), c(3, 6), c(4, 7)),
sep = "", times = c("prod1", "prod2"))
我有一个 pandas DataFrame
这样的:
df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})
In [54]: df
Out[54]:
custid prod1 prod1_hnode1 prod1_hnode2 prod2 prod2_hnode1 \
0 1 jeans 1 6 tshirt 2
1 2 tshirt 2 7 jeans 1
2 3 jacket 3 8 jacket 3
3 4 tshirt 2 7 shirt 4
prod2_hnode2
0 7
1 6
2 8
3 7
如何将其转换为以下格式:
dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})
In [56]: dfnew
Out[56]:
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
使用:
set_index
按列custid
- 按
split
在列中创建 - 将列中的
NaN
替换为rec
stack
一级reset_index
来自MultiIndex
的列
- 重命名列
MultiIndex
df = df.set_index('custid')
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns={np.nan:'rec'})
cols = ['custid','hnode1','hnode2','prod','rec']
df = df.stack(0).reset_index().rename(columns={'level_1':'prod'}).reindex(columns=cols)
print (df)
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
这是另一种应该有效的方法,但使用了重复的 melt
s。
coln = df.dtypes.index # save some typing
df_long = pd.melt(
df, id_vars = "custid", value_vars = ["prod1", "prod2"],
var_name = "prod", value_name = "rec").assign(
hnode1 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode1" in x, coln))["value"],
hnode2 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode2" in x, coln))["value"])
print(df_long)
custid prod rec hnode1 hnode2
0 1 prod1 jeans 1 6
1 2 prod1 tshirt 2 7
2 3 prod1 jacket 3 8
3 4 prod1 tshirt 2 7
4 1 prod2 tshirt 2 7
5 2 prod2 jeans 1 6
6 3 prod2 jacket 3 8
7 4 prod2 shirt 4 7
您在评论中提到了 R。 "data.table" 中的 melt
应该能够更轻松地处理这个问题,因为您可以一次融化多组列,类似于您可能如何处理基本 R 的 reshape
函数的问题。
基础 R 方法可能类似于:
reshape(df, direction = "long", idvar = "custid",
varying = list(c(2, 5), c(3, 6), c(4, 7)),
sep = "", times = c("prod1", "prod2"))