在 SparkR 中删除 DataFrame 的列

Question

我想知道是否有一种简洁的方法可以在 SparkR 中删除 DataFrame 的列，例如 df.drop("column_name") 在 pyspark[=18 中=].

这是我能得到的最接近的：

df <- new("DataFrame", sdf=SparkR:::callJMethod(df@sdf, "drop", "column_name"), isCached=FALSE)

Answer 1

Spark >= 2.0.0

您可以使用drop函数：

drop(df, "column_name")

Spark < 2.0.0

您可以使用 select 函数来 select 您需要的内容，以继续为其提供一组具有名称或列表达式的列。

用法：

## S4 method for signature 'DataFrame'
x$name
## S4 replacement method for signature 'DataFrame'
x$name <- value
## S4 method for signature 'DataFrame,character'
select(x, col, ...)
## S4 method for signature 'DataFrame,Column'
select(x, col, ...)
## S4 method for signature 'DataFrame,list'
select(x, col)
select(x, col, ...)
selectExpr(x, expr, ...)

示例：

select(df, "*")
select(df, "col1", "col2")
select(df, df$name, df$age + 1)
select(df, c("col1", "col2"))
select(df, list(df$name, df$age + 1))

# Similar to R data frames columns can also be selected using `$`
df$age

您可能还对 subset 函数感兴趣，该函数根据给定条件 returns DataFrame 的子集。

我邀请您阅读官方文档 here 以获取更多信息和示例。

Answer 2

这可以通过将 NULL 分配给 Spark 数据帧列来实现：

df$column_name <- NULL

请参阅相关 Spark JIRA ticket 中的原始讨论。

Answer 3

利用 select:

drop_columns = function(df, cols) {
                    # Names of columns
                    col_names = df %>% colnames
                    # Filter out column names passed in
                    col_names = col_names[!(col_names %in% cols)]
                    # Select remaining columns
                    df %>% select(col_names)}

df %>% drop_columns(c('column1', 'column2'))

在 SparkR 中删除 DataFrame 的列

Drop a DataFrame's Column in SparkR

r

apache-spark

apache-spark-sql

sparkr