如何检查 Spark 中两个 DataFrame 列的交集
How to check for intersection of two DataFrame columns in Spark
使用 pyspark
或 sparkr
(最好同时使用),如何获得两个 DataFrame
列的交集?例如,在 sparkr
我有以下 DataFrames
:
newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)
name surname
1 George Williams
#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)
Error in as.vector(y) : no method for coercing this S4 class to a
vector
如何让 intersect
对单个列起作用?
您需要两个 Spark DataFrames 才能使用 intersect 函数。您可以使用 select 函数从每个 DataFrame 中获取特定列。
在 SparkR 中:
newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))
在 pyspark 中:
newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))
使用 pyspark
或 sparkr
(最好同时使用),如何获得两个 DataFrame
列的交集?例如,在 sparkr
我有以下 DataFrames
:
newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)
name surname
1 George Williams
#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)
Error in as.vector(y) : no method for coercing this S4 class to a vector
如何让 intersect
对单个列起作用?
您需要两个 Spark DataFrames 才能使用 intersect 函数。您可以使用 select 函数从每个 DataFrame 中获取特定列。
在 SparkR 中:
newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))
在 pyspark 中:
newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))