如何像在 pyspark withColumn 中那样在 pandas 中动态创建新列
How to create new column dynamically in pandas like we do in pyspark withColumn
from statistics import mean
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C'])
df["A"] = [1, 2, 3, 4, 4, 5, 6]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = [10, 20, 30, 40, 30, 50, 60]
df1 = df.groupby(["A","B"]).agg(mean_err=("C", mean)).reset_index()
df1["threshold"] = df1["A"] * df1["mean_err"]
而不是最后一行代码,我怎么能像在 Pyspark .withColumn() 中那样做呢?
此代码无效。我想像在 Pyspark 的 withColumn 方法中一样,通过动态使用操作输出来创建新列。
有人知道怎么做吗?
选项 1:DataFrame.eval
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.eval('threshold = A * mean_err'))
选项 2:DataFrame.assign
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.assign(threshold=lambda x: x['A'] * x['mean_err']))
A B mean_err threshold
0 1 Feb 10.0 10.0
1 2 Feb 20.0 40.0
2 3 Feb 30.0 90.0
3 4 May 35.0 140.0
4 5 May 50.0 250.0
5 6 May 60.0 360.0
from statistics import mean
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C'])
df["A"] = [1, 2, 3, 4, 4, 5, 6]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = [10, 20, 30, 40, 30, 50, 60]
df1 = df.groupby(["A","B"]).agg(mean_err=("C", mean)).reset_index()
df1["threshold"] = df1["A"] * df1["mean_err"]
而不是最后一行代码,我怎么能像在 Pyspark .withColumn() 中那样做呢?
此代码无效。我想像在 Pyspark 的 withColumn 方法中一样,通过动态使用操作输出来创建新列。
有人知道怎么做吗?
选项 1:DataFrame.eval
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.eval('threshold = A * mean_err'))
选项 2:DataFrame.assign
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.assign(threshold=lambda x: x['A'] * x['mean_err']))
A B mean_err threshold
0 1 Feb 10.0 10.0
1 2 Feb 20.0 40.0
2 3 Feb 30.0 90.0
3 4 May 35.0 140.0
4 5 May 50.0 250.0
5 6 May 60.0 360.0