根据列值选择行后将列添加到 DataFrame
Add a column to a DataFrame after selecting rows based on column values
我有一个天气预报数据集,我对以下列感兴趣:
period
(值:1,2,3)
temp2m
:对应气象站2米外的温度。
p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()
p1 的此代码给出以下(前 5 个):
+------+------+
|period|temp2m|
+------+------+
| 0| 12|
| 0| 13|
| 0| 11|
| 0| 13|
| 0| 10|
+------+------+
我想将 temp2m
的结果作为 temp2m_p1
存储在主 DataFrame new_df
中。同样,我也想添加 temp2m_p2
和 temp2m_p2
。但是我在 https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/.
上找不到这个问题的解决方案
请始终提供玩具示例和预期结果。这是我的:
new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})
period temp2m
0 1 12
1 1 13
2 1 12
3 2 20
4 2 21
5 2 22
6 3 18
7 3 18
8 3 16
我相信你想要:
for p in new_df['period'].unique():
new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)
这导致:
period temp2m temp2m_p1 temp2m_p2 temp2m_p3
0 1 12 12.0 NaN NaN
1 1 13 13.0 NaN NaN
2 1 12 12.0 NaN NaN
3 2 20 NaN 20.0 NaN
4 2 21 NaN 21.0 NaN
5 2 22 NaN 22.0 NaN
6 3 18 NaN NaN 18.0
7 3 18 NaN NaN 18.0
8 3 16 NaN NaN 16.0
编辑 评论后:
df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)
这给出:
temp2m_1 temp2m_2 temp2m_3
0 12 20 18
1 13 21 18
2 12 22 16
我有一个天气预报数据集,我对以下列感兴趣:
period
(值:1,2,3)temp2m
:对应气象站2米外的温度。
p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()
p1 的此代码给出以下(前 5 个):
+------+------+
|period|temp2m|
+------+------+
| 0| 12|
| 0| 13|
| 0| 11|
| 0| 13|
| 0| 10|
+------+------+
我想将 temp2m
的结果作为 temp2m_p1
存储在主 DataFrame new_df
中。同样,我也想添加 temp2m_p2
和 temp2m_p2
。但是我在 https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/.
请始终提供玩具示例和预期结果。这是我的:
new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})
period temp2m
0 1 12
1 1 13
2 1 12
3 2 20
4 2 21
5 2 22
6 3 18
7 3 18
8 3 16
我相信你想要:
for p in new_df['period'].unique():
new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)
这导致:
period temp2m temp2m_p1 temp2m_p2 temp2m_p3
0 1 12 12.0 NaN NaN
1 1 13 13.0 NaN NaN
2 1 12 12.0 NaN NaN
3 2 20 NaN 20.0 NaN
4 2 21 NaN 21.0 NaN
5 2 22 NaN 22.0 NaN
6 3 18 NaN NaN 18.0
7 3 18 NaN NaN 18.0
8 3 16 NaN NaN 16.0
编辑 评论后:
df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)
这给出:
temp2m_1 temp2m_2 temp2m_3
0 12 20 18
1 13 21 18
2 12 22 16