如何提取r/sparklyr数据集中没有空值的列名?
How to extract the column names which doesn't have any null values in a dataset in r/sparklyr?
我只想提取 r 的大型数据集中没有空值的列名。
如果我的 table 有 4 列(id、Price、Product、Status),其中 Price 和 Status 列有一些空值,而列 id 和 Product 没有空值。然后我希望我的输出为:id, Product
data <- data.frame(ID = c(1,2,3,4),
Price = c(50, NA, 10, 20),
Product = c("A", "B", "C", "D"),
Status = c("Complete", NA, "Complete", "Incomplete"))
names(apply(data, 2, anyNA)[apply(data, 2, anyNA) == FALSE])
如果您需要一个确切的答案,您必须先扫描整个数据集,以计算缺失值:
library(dplyr)
df <- copy_to(sc, tibble(
id = 1:4, Price = c(NA, 3.20, NA, 42),
Product = c("p1", "p2", "p3", "p4"),
Status = c(NA, "foo", "bar", NA)))
null_counts <- df %>%
summarise_all(funs(sum(as.numeric(is.na(.)), na.rm=TRUE))) %>%
collect()
null_counts
# A tibble: 1 x 4
id Price Product Status
<dbl> <dbl> <dbl> <dbl>
1 0 2 0 2
确定哪些列的缺失计数为零:
cols_without_nulls <- null_counts %>%
select_if(funs(. == 0)) %>%
colnames()
cols_without_nulls
[1] "id" "Product"
并将这些用于 select
df %>% select(one_of(cols_without_nulls))
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
存在更短的变体:
df %>% select_if(funs(sum(as.numeric(is.na(.)), na.rm=TRUE) == 0))
Applying predicate on the first 100 rows
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
但如您所见,它只会对数据进行采样。
我只想提取 r 的大型数据集中没有空值的列名。
如果我的 table 有 4 列(id、Price、Product、Status),其中 Price 和 Status 列有一些空值,而列 id 和 Product 没有空值。然后我希望我的输出为:id, Product
data <- data.frame(ID = c(1,2,3,4),
Price = c(50, NA, 10, 20),
Product = c("A", "B", "C", "D"),
Status = c("Complete", NA, "Complete", "Incomplete"))
names(apply(data, 2, anyNA)[apply(data, 2, anyNA) == FALSE])
如果您需要一个确切的答案,您必须先扫描整个数据集,以计算缺失值:
library(dplyr)
df <- copy_to(sc, tibble(
id = 1:4, Price = c(NA, 3.20, NA, 42),
Product = c("p1", "p2", "p3", "p4"),
Status = c(NA, "foo", "bar", NA)))
null_counts <- df %>%
summarise_all(funs(sum(as.numeric(is.na(.)), na.rm=TRUE))) %>%
collect()
null_counts
# A tibble: 1 x 4
id Price Product Status
<dbl> <dbl> <dbl> <dbl>
1 0 2 0 2
确定哪些列的缺失计数为零:
cols_without_nulls <- null_counts %>%
select_if(funs(. == 0)) %>%
colnames()
cols_without_nulls
[1] "id" "Product"
并将这些用于 select
df %>% select(one_of(cols_without_nulls))
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
存在更短的变体:
df %>% select_if(funs(sum(as.numeric(is.na(.)), na.rm=TRUE) == 0))
Applying predicate on the first 100 rows
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
但如您所见,它只会对数据进行采样。