SparkR:如何提取特定列中包含空值的行
SparkR: How to extract rows that contain null values in a specific column
免责声明:我几乎没有使用 SparkR 的经验
采用以下数据框:
ID Date1 Date2
58844880 04/11/16 NaN
59745846 04/12/16 04/14/16
59743311 04/13/16 NaN
59745848 04/14/16 04/11/16
59598413 NaN NaN
59745921 04/14/16 04/14/16
59561199 04/15/16 04/15/16
NaN 04/16/16 04/16/16
59561198 NaN 04/17/16
我只想获取日期 2 列中具有 NaN
的行。
在 R 中,我会使用 new_DF <- DF[is.na(DF$Var),]
但我不确定在 SparkR 中该做什么
可以将 filter
与条件 isNull
一起使用,例如:
DF2 <- SparkR::filter(DF, isNull(DF$Date2))
这是一个使用 sparklyr
的选项
library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
DF1 = copy_to(con, DF)
DF1 %>%
mutate_at(vars(matches("Date")),
funs(to_date(from_unixtime(unix_timestamp(., "MM/dd/yy"))))) %>%
filter(is.na(Date2)) %>%
collect()
# A tibble: 3 x 3
# ID Date1 Date2
# <dbl> <date> <date>
#1 58844880 2016-04-10 NA
#2 59743311 2016-04-12 NA
#3 59598413 NA NA
spark_disconnect(con)
数据
DF <- structure(list(ID = c(58844880, 59745846, 59743311, 59745848,
59598413, 59745921, 59561199, NaN, 59561198), Date1 = c("04/11/16",
"04/12/16", "04/13/16", "04/14/16", "NaN", "04/14/16", "04/15/16",
"04/16/16", "NaN"), Date2 = c("NaN", "04/14/16", "NaN", "04/11/16",
"NaN", "04/14/16", "04/15/16", "04/16/16", "04/17/16")), .Names = c("ID",
"Date1", "Date2"), class = "data.frame", row.names = c(NA, -9L
))
免责声明:我几乎没有使用 SparkR 的经验
采用以下数据框:
ID Date1 Date2
58844880 04/11/16 NaN
59745846 04/12/16 04/14/16
59743311 04/13/16 NaN
59745848 04/14/16 04/11/16
59598413 NaN NaN
59745921 04/14/16 04/14/16
59561199 04/15/16 04/15/16
NaN 04/16/16 04/16/16
59561198 NaN 04/17/16
我只想获取日期 2 列中具有 NaN
的行。
在 R 中,我会使用 new_DF <- DF[is.na(DF$Var),]
但我不确定在 SparkR 中该做什么
可以将 filter
与条件 isNull
一起使用,例如:
DF2 <- SparkR::filter(DF, isNull(DF$Date2))
这是一个使用 sparklyr
library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
DF1 = copy_to(con, DF)
DF1 %>%
mutate_at(vars(matches("Date")),
funs(to_date(from_unixtime(unix_timestamp(., "MM/dd/yy"))))) %>%
filter(is.na(Date2)) %>%
collect()
# A tibble: 3 x 3
# ID Date1 Date2
# <dbl> <date> <date>
#1 58844880 2016-04-10 NA
#2 59743311 2016-04-12 NA
#3 59598413 NA NA
spark_disconnect(con)
数据
DF <- structure(list(ID = c(58844880, 59745846, 59743311, 59745848,
59598413, 59745921, 59561199, NaN, 59561198), Date1 = c("04/11/16",
"04/12/16", "04/13/16", "04/14/16", "NaN", "04/14/16", "04/15/16",
"04/16/16", "NaN"), Date2 = c("NaN", "04/14/16", "NaN", "04/11/16",
"NaN", "04/14/16", "04/15/16", "04/16/16", "04/17/16")), .Names = c("ID",
"Date1", "Date2"), class = "data.frame", row.names = c(NA, -9L
))