通过保留所有列中第一次出现的唯一元素来对 R 中的数据框进行子集化
Subsetting a dataframe in R by retaining first occurrences of unique elements from all columns
df
是一个测试数据框,它有 5 行和 6 列,它是一个更大的数据框的子集(尺寸:1000000 X 30)。
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5"),
category1 = c("a", "", "b", "a", ""),
category2 = c("c", "", "", "d", "c"),
category3 = c("", "", "e", "f", "f"),
category4 = c("", "", "", "", ""),
category5 = c("i", "", "i", "j", ""))
df
看起来像这样
对于从 category1
到 category5
的每一列,我只需要保留所有唯一元素的第一次出现。例如对于 category1
,唯一元素是 a
和 b
,它们的第一次出现分别在第 1 行和第 3 行。所以应该保留第 1 行和第 3 行,依此类推。输出应该是这个样子
使用 lapply
和 duplicated
您可以先用 ""
替换每列的重复项,然后过滤包含至少一个非 ""
字符串的行:
df[-1] <- lapply(df[-1], function(x) {
x[duplicated(x)] <- ""
x
})
df <- df[rowSums(!df[-1] == "") > 0, ]
df
#> Hits category1 category2 category3 category4 category5
#> 1 Hit1 a c i
#> 3 Hit3 b e
#> 4 Hit4 d f j
我被要求找到 tidyverse
解决方案并最终得到这个,不是推荐作为解决方案而是作为学习效果:
基本的想法是将数据以长格式删除重复项并恢复为宽格式,但是这个“简单”的想法结果非常复杂,正如您在此处看到的那样:
library(tidyverse)
df %>%
pivot_longer(
everything()
) %>%
mutate(value = na_if(value, "")) %>%
unique() %>%
group_by(id = cumsum(name=="Hits")) %>%
mutate(row = row_number()) %>%
pivot_wider() %>%
fill(everything(), .direction = "updown") %>%
filter(if_any(category1:category5, ~ !is.na(.))) %>%
slice(1) %>%
ungroup() %>%
select(-c(id, row)) %>%
mutate(across(everything(), ~replace_na(.,"")))
Hits category1 category2 category3 category4 category5
<chr> <chr> <chr> <chr> <chr> <chr>
1 Hit1 "a" "c" "" "" "i"
2 Hit3 "b" "" "e" "" ""
3 Hit4 "" "d" "f" "" "j"
另一种可能的解决方案,基于 dplyr
和 purrr::map_dfc
:
library(tidyverse)
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5"),
category1 = c("a", "", "b", "a", ""),
category2 = c("c", "", "", "d", "c"),
category3 = c("", "", "e", "f", "f"),
category4 = c("", "", "", "", ""),
category5 = c("i", "", "i", "j", ""))
df %>%
map_dfc(~ if_else(duplicated(.x), "", .x)) %>%
filter(rowSums(. == "") != 5)
#> # A tibble: 3 × 6
#> Hits category1 category2 category3 category4 category5
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Hit1 "a" "c" "" "" "i"
#> 2 Hit3 "b" "" "e" "" ""
#> 3 Hit4 "" "d" "f" "" "j"
A data.table 版本,遵循@PaulS 的 tidyverse 方法;可能有助于提高 1,000,000 行等的速度
library(data.table)
setDT(df)
df = cbind(df[,1], df[,-1][, lapply(.SD, \(x) fifelse(duplicated(x),"",x))])
df[rowSums(df[,-1]=="")<5]
输出
Hits category1 category2 category3 category4 category5
1: Hit1 a c i
2: Hit3 b e
3: Hit4 d f j
df
是一个测试数据框,它有 5 行和 6 列,它是一个更大的数据框的子集(尺寸:1000000 X 30)。
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5"),
category1 = c("a", "", "b", "a", ""),
category2 = c("c", "", "", "d", "c"),
category3 = c("", "", "e", "f", "f"),
category4 = c("", "", "", "", ""),
category5 = c("i", "", "i", "j", ""))
df
看起来像这样
对于从 category1
到 category5
的每一列,我只需要保留所有唯一元素的第一次出现。例如对于 category1
,唯一元素是 a
和 b
,它们的第一次出现分别在第 1 行和第 3 行。所以应该保留第 1 行和第 3 行,依此类推。输出应该是这个样子
使用 lapply
和 duplicated
您可以先用 ""
替换每列的重复项,然后过滤包含至少一个非 ""
字符串的行:
df[-1] <- lapply(df[-1], function(x) {
x[duplicated(x)] <- ""
x
})
df <- df[rowSums(!df[-1] == "") > 0, ]
df
#> Hits category1 category2 category3 category4 category5
#> 1 Hit1 a c i
#> 3 Hit3 b e
#> 4 Hit4 d f j
我被要求找到 tidyverse
解决方案并最终得到这个,不是推荐作为解决方案而是作为学习效果:
基本的想法是将数据以长格式删除重复项并恢复为宽格式,但是这个“简单”的想法结果非常复杂,正如您在此处看到的那样:
library(tidyverse)
df %>%
pivot_longer(
everything()
) %>%
mutate(value = na_if(value, "")) %>%
unique() %>%
group_by(id = cumsum(name=="Hits")) %>%
mutate(row = row_number()) %>%
pivot_wider() %>%
fill(everything(), .direction = "updown") %>%
filter(if_any(category1:category5, ~ !is.na(.))) %>%
slice(1) %>%
ungroup() %>%
select(-c(id, row)) %>%
mutate(across(everything(), ~replace_na(.,"")))
Hits category1 category2 category3 category4 category5
<chr> <chr> <chr> <chr> <chr> <chr>
1 Hit1 "a" "c" "" "" "i"
2 Hit3 "b" "" "e" "" ""
3 Hit4 "" "d" "f" "" "j"
另一种可能的解决方案,基于 dplyr
和 purrr::map_dfc
:
library(tidyverse)
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5"),
category1 = c("a", "", "b", "a", ""),
category2 = c("c", "", "", "d", "c"),
category3 = c("", "", "e", "f", "f"),
category4 = c("", "", "", "", ""),
category5 = c("i", "", "i", "j", ""))
df %>%
map_dfc(~ if_else(duplicated(.x), "", .x)) %>%
filter(rowSums(. == "") != 5)
#> # A tibble: 3 × 6
#> Hits category1 category2 category3 category4 category5
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Hit1 "a" "c" "" "" "i"
#> 2 Hit3 "b" "" "e" "" ""
#> 3 Hit4 "" "d" "f" "" "j"
A data.table 版本,遵循@PaulS 的 tidyverse 方法;可能有助于提高 1,000,000 行等的速度
library(data.table)
setDT(df)
df = cbind(df[,1], df[,-1][, lapply(.SD, \(x) fifelse(duplicated(x),"",x))])
df[rowSums(df[,-1]=="")<5]
输出
Hits category1 category2 category3 category4 category5
1: Hit1 a c i
2: Hit3 b e
3: Hit4 d f j