如何根据特定的现有列的值将值放入新的单独的 DataFrame 列?即转置 DataFrame
How to put values to the new separate DataFrame columns based on the value of specific already existing column? I.e transpose the DataFrame
我有一个包含不同来源的混合数据的 DataFrame,请注意,有一部分数据是在同一时间戳获得的:
+--------------------------------------+------+-------------------+-----------------+---------------+-----------------------+
|devicename |value |time |one_type_id|another_type_id|write_time |
+--------------------------------------+------+-------------------+-----------------+---------------+-----------------------+
|Real_Power_KPI |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Voltage_Sensor |243.93|2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Current_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Casing_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor |17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Environment_Ambient_Temperature_Sensor|17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Pump_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Level_Sensor |15.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Environment_Humidity_Sensor |81.2 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor |17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Casing_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Pump_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Environment_Ambient_Temperature_Sensor|17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Water_Level_Sensor |15.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Environment_Humidity_Sensor |81.2 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Real_Power_KPI |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Voltage_Sensor |245.01|2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Current_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Real_Power_KPI |0.0 |2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Voltage_Sensor |244.31|2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Current_Sensor |0.0 |2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
所以,我想要的是为 Real_Power_KPI、Voltage_Sensor、Current_Sensor 设置单独的列,并将它们的相应值连接在一行中,同时具有相同的时间戳。
类似
|timestamp |Real_Power_KPI|Voltage_Sensor|Current_Sensor|
|2021-03-24 07:06:36|0.0 |244.31 |0.0 |
那么如何以最佳方式执行此转置操作?
UPD.
在过过招的回答中提出了Python代码,下面是它的Scala:
val df = dailySensorData.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')")
.groupBy("time", "devicename").agg(expr("sum(value) as total"))
.groupBy("time").pivot("devicename").agg(expr("first(total)"))
df.show(false)
先分组汇总,再用pivot
行转列
df = df.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')") \
.groupBy('time', 'devicename').agg(F.expr('sum(value) as total')) \
.groupBy('time').pivot('devicename').agg(F.expr('first(total)'))
df.show(truncate=False)
我有一个包含不同来源的混合数据的 DataFrame,请注意,有一部分数据是在同一时间戳获得的:
+--------------------------------------+------+-------------------+-----------------+---------------+-----------------------+
|devicename |value |time |one_type_id|another_type_id|write_time |
+--------------------------------------+------+-------------------+-----------------+---------------+-----------------------+
|Real_Power_KPI |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Voltage_Sensor |243.93|2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Current_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.129|
|Casing_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor |17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Environment_Ambient_Temperature_Sensor|17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Pump_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Level_Sensor |15.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Environment_Humidity_Sensor |81.2 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor |17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Casing_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Pump_Vibration_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Environment_Ambient_Temperature_Sensor|17.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Water_Level_Sensor |15.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Environment_Humidity_Sensor |81.2 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Real_Power_KPI |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Voltage_Sensor |245.01|2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Current_Sensor |0.0 |2021-03-24 07:06:35|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Real_Power_KPI |0.0 |2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Voltage_Sensor |244.31|2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
|Current_Sensor |0.0 |2021-03-24 07:06:36|NP20100000 |NP20100000 |2021-03-24 07:06:37.01 |
所以,我想要的是为 Real_Power_KPI、Voltage_Sensor、Current_Sensor 设置单独的列,并将它们的相应值连接在一行中,同时具有相同的时间戳。
类似
|timestamp |Real_Power_KPI|Voltage_Sensor|Current_Sensor|
|2021-03-24 07:06:36|0.0 |244.31 |0.0 |
那么如何以最佳方式执行此转置操作?
UPD.
在过过招的回答中提出了Python代码,下面是它的Scala:
val df = dailySensorData.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')")
.groupBy("time", "devicename").agg(expr("sum(value) as total"))
.groupBy("time").pivot("devicename").agg(expr("first(total)"))
df.show(false)
先分组汇总,再用pivot
行转列
df = df.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')") \
.groupBy('time', 'devicename').agg(F.expr('sum(value) as total')) \
.groupBy('time').pivot('devicename').agg(F.expr('first(total)'))
df.show(truncate=False)