如何为以下语句编写等效的 pyspark 代码?
How do I write an equivalent pyspark code for the following staement?
X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
如何用 pyspark sql 函数替换 numpy?
你定义一个udf函数
from spark.sql import function as F
from pyspark.sql.types import StringType()
def dictonnary(x):
if x in frequent_ls:
return x
else:
return "rare"
replace = F.udf(lambda x: dictionnary(x), StrungType())
Xtrain = xtrain.withColumn("var2", replace(F.col("var")))
您可以简单地使用 .isin
运算符:
import pyspark.sql.functions as F
X_train = (X_train
.withColumn(var, F.when(X_train[var].isin(frequent_ls), X_train[var]).otherwise('Rare'))
X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
如何用 pyspark sql 函数替换 numpy?
你定义一个udf函数
from spark.sql import function as F
from pyspark.sql.types import StringType()
def dictonnary(x):
if x in frequent_ls:
return x
else:
return "rare"
replace = F.udf(lambda x: dictionnary(x), StrungType())
Xtrain = xtrain.withColumn("var2", replace(F.col("var")))
您可以简单地使用 .isin
运算符:
import pyspark.sql.functions as F
X_train = (X_train
.withColumn(var, F.when(X_train[var].isin(frequent_ls), X_train[var]).otherwise('Rare'))