根据列值选择行后将列添加到 DataFrame

Question

我有一个天气预报数据集，我对以下列感兴趣：

period（值：1,2,3）
temp2m：对应气象站2米外的温度。

p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()

p1 的此代码给出以下（前 5 个）：

+------+------+
|period|temp2m|
+------+------+
|     0|    12|
|     0|    13|
|     0|    11|
|     0|    13|
|     0|    10|
+------+------+

我想将 temp2m 的结果作为 temp2m_p1 存储在主 DataFrame new_df 中。同样，我也想添加 temp2m_p2 和 temp2m_p2。但是我在 https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/.

上找不到这个问题的解决方案

Answer 1

请始终提供玩具示例和预期结果。这是我的：

new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                       'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})

   period  temp2m
0       1      12
1       1      13
2       1      12
3       2      20
4       2      21
5       2      22
6       3      18
7       3      18
8       3      16

我相信你想要：

for p in new_df['period'].unique():
    new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)

这导致：

   period  temp2m  temp2m_p1  temp2m_p2  temp2m_p3
0       1      12       12.0        NaN        NaN
1       1      13       13.0        NaN        NaN
2       1      12       12.0        NaN        NaN
3       2      20        NaN       20.0        NaN
4       2      21        NaN       21.0        NaN
5       2      22        NaN       22.0        NaN
6       3      18        NaN        NaN       18.0
7       3      18        NaN        NaN       18.0
8       3      16        NaN        NaN       16.0

编辑评论后：

df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)

这给出：

   temp2m_1  temp2m_2  temp2m_3
0      12      20      18
1      13      21      18
2      12      22      16

根据列值选择行后将列添加到 DataFrame

Add a column to a DataFrame after selecting rows based on column values

python

dataframe

pandas

pyspark