堆叠、拆开、融化、旋转、转置?将多列转换为行的简单方法是什么(PySpark 或 Pandas)?)
Stack, unstack, melt, pivot, transpose? What is the simple method to convert multiple columns into rows (PySpark or Pandas)?)
我的工作环境主要使用PySpark,但是做了一些谷歌搜索,在PySpark中转置非常复杂。我想将它保留在 PySpark 中,但如果在 Pandas 中更容易做到这一点,我会将 Spark 数据帧转换为 Pandas 数据帧。数据集不是很大,我认为性能是一个问题。
我想将具有多列的数据框转换为行:
输入:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective Normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive Normal Normal Effective
3 General Hospital 994 Random Street 923 Normal Effective Normal Effective
输出:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
在查看 PySpark 示例时,它很复杂:
并查看 Pandas 示例,看起来容易多了。但是有许多不同的 Stack Overflow 答案,有人说使用 pivot、melt、stack、unstack,还有更多的答案最终会让人感到困惑。
因此,如果有人有在 PySpark 中执行此操作的简单方法,我会洗耳恭听。如果没有,我会很乐意回答Pandas。
非常感谢您的帮助!
这里是 pandas 使用 stack
df_final = (df.set_index(['Record', 'Hospital', 'Hospital Address'])
.stack(dropna=False)
.rename('Value')
.reset_index()
.rename({'level_3': 'Name'},axis=1)
.assign(Record=lambda x: x.index+1))
Out[120]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
您也可以使用 .melt
并指定 id_vars
。其他一切都会被考虑 value_vars
。您拥有的 value_vars
列数会将数据框中的行数乘以该数字,将四列中的所有列信息堆叠到一列中,并将 id_var
列复制到您想要的格式:
数据框设置:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
代码:
df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
var_name='Name',
value_name='Value')
.sort_values('Record')
.reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
使用 pyspark 也可以使用 stack
。simple/easy。
# create sample data
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
df = spark.createDataFrame(panda_df)
# calculate
df.select("Hospital","Hospital Address",
expr("stack(4, 'Medicine_1', Medicine_1, 'Medicine_2', Medicine_2, \
'Medicine_3', Medicine_3,'Medicine_4',Medicine_4) as (MedicinName, Effectiveness)")
).where("Effectiveness is not null").show()
在很多列的情况下动态生成查询:
这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用 python 字符串格式来制作动态字符串。
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
输出:
+----------------+--------------------+-----------+-------------+
| Hospital| Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
| Red Cross| 1234 Street 429| Medicine_1| Effective|
| Red Cross| 1234 Street 429| Medicine_2| Effective|
| Red Cross| 1234 Street 429| Medicine_3| Normal|
| Red Cross| 1234 Street 429| Medicine_4| Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1| Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4| Effective|
|General Hospital|994 Random Street...| Medicine_1| Normal|
|General Hospital|994 Random Street...| Medicine_2| Effective|
|General Hospital|994 Random Street...| Medicine_3| Normal|
|General Hospital|994 Random Street...| Medicine_4| Effective|
+----------------+--------------------+-----------+-------------+
我的工作环境主要使用PySpark,但是做了一些谷歌搜索,在PySpark中转置非常复杂。我想将它保留在 PySpark 中,但如果在 Pandas 中更容易做到这一点,我会将 Spark 数据帧转换为 Pandas 数据帧。数据集不是很大,我认为性能是一个问题。
我想将具有多列的数据框转换为行:
输入:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective Normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive Normal Normal Effective
3 General Hospital 994 Random Street 923 Normal Effective Normal Effective
输出:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
在查看 PySpark 示例时,它很复杂:
并查看 Pandas 示例,看起来容易多了。但是有许多不同的 Stack Overflow 答案,有人说使用 pivot、melt、stack、unstack,还有更多的答案最终会让人感到困惑。
因此,如果有人有在 PySpark 中执行此操作的简单方法,我会洗耳恭听。如果没有,我会很乐意回答Pandas。
非常感谢您的帮助!
这里是 pandas 使用 stack
df_final = (df.set_index(['Record', 'Hospital', 'Hospital Address'])
.stack(dropna=False)
.rename('Value')
.reset_index()
.rename({'level_3': 'Name'},axis=1)
.assign(Record=lambda x: x.index+1))
Out[120]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
您也可以使用 .melt
并指定 id_vars
。其他一切都会被考虑 value_vars
。您拥有的 value_vars
列数会将数据框中的行数乘以该数字,将四列中的所有列信息堆叠到一列中,并将 id_var
列复制到您想要的格式:
数据框设置:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
代码:
df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
var_name='Name',
value_name='Value')
.sort_values('Record')
.reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
使用 pyspark 也可以使用 stack
。simple/easy。
# create sample data
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
df = spark.createDataFrame(panda_df)
# calculate
df.select("Hospital","Hospital Address",
expr("stack(4, 'Medicine_1', Medicine_1, 'Medicine_2', Medicine_2, \
'Medicine_3', Medicine_3,'Medicine_4',Medicine_4) as (MedicinName, Effectiveness)")
).where("Effectiveness is not null").show()
在很多列的情况下动态生成查询:
这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用 python 字符串格式来制作动态字符串。
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
输出:
+----------------+--------------------+-----------+-------------+
| Hospital| Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
| Red Cross| 1234 Street 429| Medicine_1| Effective|
| Red Cross| 1234 Street 429| Medicine_2| Effective|
| Red Cross| 1234 Street 429| Medicine_3| Normal|
| Red Cross| 1234 Street 429| Medicine_4| Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1| Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4| Effective|
|General Hospital|994 Random Street...| Medicine_1| Normal|
|General Hospital|994 Random Street...| Medicine_2| Effective|
|General Hospital|994 Random Street...| Medicine_3| Normal|
|General Hospital|994 Random Street...| Medicine_4| Effective|
+----------------+--------------------+-----------+-------------+