如何使用 SparkR select 行并为其分配新值？

Question

在 R 编程语言中，我可以执行以下操作：

x <- c(1, 8, 3, 5, 6)
y <- rep("Down",5)
y[x>5] <- "Up"

这将导致 y 向量为 ("Down", "Up", "Down", "Down", "Up")

现在我的 x 序列是 predict 函数在线性模型拟合上的输出。 R returns 中的 predict 函数是一个序列，而 Spark returns 中的 predict 函数是一个包含测试数据集的列 + 列 label 和prediction.

来自运行

y[x$prediction > .5]

我收到错误：

Error in y[x$prediction > 0.5] : invalid subscript type 'S4'

我该如何解决这个问题？

Answer 1

在选择行时：

您的方法行不通，因为 y 作为 Spark predict 的产品，是一个 Spark（而非 R）数据帧；你应该使用 SparkR 的 filter 函数。这是使用 iris 数据集的可重现示例：

library(SparkR)
sparkR.version()
# "2.2.1"

df <- as.DataFrame(iris)
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
nrow(df)
# 150

# Let's keep only the records with Petal_Width > 0.2:
df2 <- filter(df, df$Petal_Width > 0.2)    
nrow(df2)
# 116

另请检查 docs 中的示例。

关于替换行值：

在 Spark 数据帧中替换行值的标准做法是首先创建一个具有所需条件的新列，然后可能删除旧列；这是一个示例，我们在上面定义的 df 中将大于 0.2 的 Petal_Width 的值替换为 0：

newDF <- withColumn(df, "new_PetalWidth", ifelse(df$Petal_Width > 0.2, 0, df$Petal_Width))
head(newDF)
# result:
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species new_PetalWidth
1          5.1         3.5          1.4         0.2  setosa            0.2
2          4.9         3.0          1.4         0.2  setosa            0.2
3          4.7         3.2          1.3         0.2  setosa            0.2
4          4.6         3.1          1.5         0.2  setosa            0.2
5          5.0         3.6          1.4         0.2  setosa            0.2
6          5.4         3.9          1.7         0.4  setosa            0.0 # <- value changed

# drop the old column:
newDF <- drop(newDF, "Petal_Width")
head(newDF)
# result:
  Sepal_Length Sepal_Width Petal_Length Species new_PetalWidth
1          5.1         3.5          1.4  setosa            0.2
2          4.9         3.0          1.4  setosa            0.2
3          4.7         3.2          1.3  setosa            0.2
4          4.6         3.1          1.5  setosa            0.2
5          5.0         3.6          1.4  setosa            0.2
6          5.4         3.9          1.7  setosa            0.0

该方法也适用于不同的列；这是一个新列取值 0 或 Petal_Width 的示例，具体取决于 Petal_Length:

的条件

newDF2 <- withColumn(df, "something_here", ifelse(df$Petal_Length > 1.4, 0, df$Petal_Width))
head(newDF2)
# result:
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species something_here
1          5.1         3.5          1.4         0.2  setosa            0.2
2          4.9         3.0          1.4         0.2  setosa            0.2
3          4.7         3.2          1.3         0.2  setosa            0.2
4          4.6         3.1          1.5         0.2  setosa            0.0
5          5.0         3.6          1.4         0.2  setosa            0.2
6          5.4         3.9          1.7         0.4  setosa            0.0

如何使用 SparkR select 行并为其分配新值？

How to select rows and assign them new values with SparkR?

apache-spark

apache-spark-sql

sparkr

spark-dataframe