根据与另一个 DataFrame 的列值匹配的列值子集 SparkR DataFrame
Subsetting SparkR DataFrame based on column values matching another DataFrame's column values
我有两个 SparkR DataFrame,newHiresDF
和 salesTeamDF
。我想根据 salesTeamDF$name
中 newHiresDF$name
的值获得 newHiresDF
的一个子集,但我想不出一种方法来做到这一点。以下是我尝试的代码。
#Create DataFrames
newHires <- data.frame(name = c("Thomas", "George", "Bill", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Thomas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
display(newHiresDF)
#Try to subset newHiresDF based on name values in salesTeamDF
#All of the below result in errors
NHsubset1 <- filter(newHiresDF, newHiresDF$name %in% salesTeamDF$name)
NHsubset2 <- filter(newHiresDF, intersect(select(newHiresDF, 'name'),
select(salesTeamDF, 'name')))
NHsubset3 <- newHiresDF[newHiresDF$name %in% salesTeamDF$name,] #This is how it would be done in R
#What I'd like NHsubset to look like:
name surname
1 Thomas Smith
2 George Williams
3 Bill Brown
如果您愿意,PySpark 代码也可以使用。
想出了一个事后看来很简单的解决方案:只需使用 merge
。
NHsubset <- merge(newHiresDF, select(salesTeamDF, 'name'))
我有两个 SparkR DataFrame,newHiresDF
和 salesTeamDF
。我想根据 salesTeamDF$name
中 newHiresDF$name
的值获得 newHiresDF
的一个子集,但我想不出一种方法来做到这一点。以下是我尝试的代码。
#Create DataFrames
newHires <- data.frame(name = c("Thomas", "George", "Bill", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Thomas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
display(newHiresDF)
#Try to subset newHiresDF based on name values in salesTeamDF
#All of the below result in errors
NHsubset1 <- filter(newHiresDF, newHiresDF$name %in% salesTeamDF$name)
NHsubset2 <- filter(newHiresDF, intersect(select(newHiresDF, 'name'),
select(salesTeamDF, 'name')))
NHsubset3 <- newHiresDF[newHiresDF$name %in% salesTeamDF$name,] #This is how it would be done in R
#What I'd like NHsubset to look like:
name surname
1 Thomas Smith
2 George Williams
3 Bill Brown
如果您愿意,PySpark 代码也可以使用。
想出了一个事后看来很简单的解决方案:只需使用 merge
。
NHsubset <- merge(newHiresDF, select(salesTeamDF, 'name'))