通过R中的模糊字符串匹配和分组汇总创建新变量的有效方法
Efficient way of creating a new variable via fuzzy string matching and grouped summarization in R
我正在尝试使用模糊字符串匹配将字符串转换为特定的 ID,并使用 dplyr 执行分组汇总。基本思想是通过字典查找方法将不完美的基因序列组合成单个基因名称,并计算该基因被检测到的次数。这样,序列 aaaaaa
和 aaaxaa
的计数与 gene1
匹配并加在一起。
我可以使用 for
和 if
语句通过将原始数据与字典进行逐行比较来做我想做的事,但我发现当我放大时这会效率低下(原始数据文件平均有 15k 行,字典有 200 行)。请在下面查看我正在努力改进的解决方案,如果您能想到一种更高效、更优雅的方法来实现这一点,请告诉我。
df <- data.frame(str_var = rep(c("aaaaaa", "aXaaaa", "bbbbbb", "bbbXbb"), 3),
grp_var = rep(c("grp1","grp2"), each=6),
num_var = rep(c(1,2), 6))
df
#> str_var grp_var num_var
#> 1 aaaaaa grp1 1
#> 2 aXaaaa grp1 2
#> 3 bbbbbb grp1 1
#> 4 bbbXbb grp1 2
#> 5 aaaaaa grp1 1
#> 6 aXaaaa grp1 2
#> 7 bbbbbb grp2 1
#> 8 bbbXbb grp2 2
#> 9 aaaaaa grp2 1
#> 10 aXaaaa grp2 2
#> 11 bbbbbb grp2 1
#> 12 bbbXbb grp2 2
dictionary <- data.frame(string = c("aaaaaa","bbbbbb", "cccccc", "dddddd"),
id = c("gene1", "gene2", "gene3", "gene4"))
dictionary
#> string id
#> 1 aaaaaa gene1
#> 2 bbbbbb gene2
#> 3 cccccc gene3
#> 4 dddddd gene4
for(i in 1:nrow(df)){
for(j in 1:nrow(dictionary)){
match_found <- agrepl(dictionary$string[j], df$str_var[i],
max.distance = list(sub=1, ins=0, del=0, all=1-1e-9))
if(match_found == TRUE){
gene = dictionary[j, "id"]
df$gene_id[i] <- gene
break
}
}
}
suppressPackageStartupMessages(library(dplyr))
new_df <- df %>%
group_by(grp_var, gene_id) %>%
summarize(gene_count=sum(num_var))
#> `summarise()` has grouped output by 'grp_var'. You can override using the `.groups` argument.
new_df
#> # A tibble: 4 x 3
#> # Groups: grp_var [2]
#> grp_var gene_id gene_count
#> <chr> <chr> <dbl>
#> 1 grp1 gene1 6
#> 2 grp1 gene2 3
#> 3 grp2 gene1 3
#> 4 grp2 gene2 6
由 reprex package (v2.0.0)
于 2021-06-08 创建
也许 fuzzyjoin
会更容易
library(fuzzyjoin)
stringdist_left_join(df, dictionary, by = c("str_var" = "string")) %>%
group_by(grp_var, gene_id = id) %>%
summarise(gene_count = sum(num_var), .groups = 'drop')
-输出
# A tibble: 4 x 3
grp_var gene_id gene_count
<chr> <chr> <dbl>
1 grp1 gene1 6
2 grp1 gene2 3
3 grp2 gene1 3
4 grp2 gene2 6
我正在尝试使用模糊字符串匹配将字符串转换为特定的 ID,并使用 dplyr 执行分组汇总。基本思想是通过字典查找方法将不完美的基因序列组合成单个基因名称,并计算该基因被检测到的次数。这样,序列 aaaaaa
和 aaaxaa
的计数与 gene1
匹配并加在一起。
我可以使用 for
和 if
语句通过将原始数据与字典进行逐行比较来做我想做的事,但我发现当我放大时这会效率低下(原始数据文件平均有 15k 行,字典有 200 行)。请在下面查看我正在努力改进的解决方案,如果您能想到一种更高效、更优雅的方法来实现这一点,请告诉我。
df <- data.frame(str_var = rep(c("aaaaaa", "aXaaaa", "bbbbbb", "bbbXbb"), 3),
grp_var = rep(c("grp1","grp2"), each=6),
num_var = rep(c(1,2), 6))
df
#> str_var grp_var num_var
#> 1 aaaaaa grp1 1
#> 2 aXaaaa grp1 2
#> 3 bbbbbb grp1 1
#> 4 bbbXbb grp1 2
#> 5 aaaaaa grp1 1
#> 6 aXaaaa grp1 2
#> 7 bbbbbb grp2 1
#> 8 bbbXbb grp2 2
#> 9 aaaaaa grp2 1
#> 10 aXaaaa grp2 2
#> 11 bbbbbb grp2 1
#> 12 bbbXbb grp2 2
dictionary <- data.frame(string = c("aaaaaa","bbbbbb", "cccccc", "dddddd"),
id = c("gene1", "gene2", "gene3", "gene4"))
dictionary
#> string id
#> 1 aaaaaa gene1
#> 2 bbbbbb gene2
#> 3 cccccc gene3
#> 4 dddddd gene4
for(i in 1:nrow(df)){
for(j in 1:nrow(dictionary)){
match_found <- agrepl(dictionary$string[j], df$str_var[i],
max.distance = list(sub=1, ins=0, del=0, all=1-1e-9))
if(match_found == TRUE){
gene = dictionary[j, "id"]
df$gene_id[i] <- gene
break
}
}
}
suppressPackageStartupMessages(library(dplyr))
new_df <- df %>%
group_by(grp_var, gene_id) %>%
summarize(gene_count=sum(num_var))
#> `summarise()` has grouped output by 'grp_var'. You can override using the `.groups` argument.
new_df
#> # A tibble: 4 x 3
#> # Groups: grp_var [2]
#> grp_var gene_id gene_count
#> <chr> <chr> <dbl>
#> 1 grp1 gene1 6
#> 2 grp1 gene2 3
#> 3 grp2 gene1 3
#> 4 grp2 gene2 6
由 reprex package (v2.0.0)
于 2021-06-08 创建也许 fuzzyjoin
会更容易
library(fuzzyjoin)
stringdist_left_join(df, dictionary, by = c("str_var" = "string")) %>%
group_by(grp_var, gene_id = id) %>%
summarise(gene_count = sum(num_var), .groups = 'drop')
-输出
# A tibble: 4 x 3
grp_var gene_id gene_count
<chr> <chr> <dbl>
1 grp1 gene1 6
2 grp1 gene2 3
3 grp2 gene1 3
4 grp2 gene2 6