通过R中的模糊字符串匹配和分组汇总创建新变量的有效方法

Efficient way of creating a new variable via fuzzy string matching and grouped summarization in R

我正在尝试使用模糊字符串匹配将字符串转换为特定的 ID,并使用 dplyr 执行分组汇总。基本思想是通过字典查找方法将不完美的基因序列组合成单个基因名称,并计算该基因被检测到的次数。这样,序列 aaaaaaaaaxaa 的计数与 gene1 匹配并加在一起。

我可以使用 forif 语句通过将原始数据与字典进行逐行比较来做我想做的事,但我发现当我放大时这会效率低下(原始数据文件平均有 15k 行,字典有 200 行)。请在下面查看我正在努力改进的解决方案,如果您能想到一种更高效、更优雅的方法来实现这一点,请告诉我。

df <- data.frame(str_var = rep(c("aaaaaa", "aXaaaa", "bbbbbb", "bbbXbb"), 3),
                 grp_var = rep(c("grp1","grp2"), each=6),
                 num_var = rep(c(1,2), 6))

df
#>    str_var grp_var num_var
#> 1   aaaaaa    grp1       1
#> 2   aXaaaa    grp1       2
#> 3   bbbbbb    grp1       1
#> 4   bbbXbb    grp1       2
#> 5   aaaaaa    grp1       1
#> 6   aXaaaa    grp1       2
#> 7   bbbbbb    grp2       1
#> 8   bbbXbb    grp2       2
#> 9   aaaaaa    grp2       1
#> 10  aXaaaa    grp2       2
#> 11  bbbbbb    grp2       1
#> 12  bbbXbb    grp2       2


dictionary <- data.frame(string = c("aaaaaa","bbbbbb", "cccccc", "dddddd"),
                         id = c("gene1", "gene2", "gene3", "gene4"))

dictionary
#>   string    id
#> 1 aaaaaa gene1
#> 2 bbbbbb gene2
#> 3 cccccc gene3
#> 4 dddddd gene4

for(i in 1:nrow(df)){
    
    
    for(j in 1:nrow(dictionary)){
        
        match_found <- agrepl(dictionary$string[j], df$str_var[i],
                              max.distance = list(sub=1, ins=0, del=0, all=1-1e-9))
        
        if(match_found == TRUE){
            
            gene = dictionary[j, "id"]
            
            df$gene_id[i] <- gene
            
            break
            
        }
        
    }
    
}

suppressPackageStartupMessages(library(dplyr))

new_df <- df %>%
    group_by(grp_var, gene_id) %>%
    summarize(gene_count=sum(num_var))
#> `summarise()` has grouped output by 'grp_var'. You can override using the `.groups` argument.

new_df
#> # A tibble: 4 x 3
#> # Groups:   grp_var [2]
#>   grp_var gene_id gene_count
#>   <chr>   <chr>        <dbl>
#> 1 grp1    gene1            6
#> 2 grp1    gene2            3
#> 3 grp2    gene1            3
#> 4 grp2    gene2            6

reprex package (v2.0.0)

于 2021-06-08 创建

也许 fuzzyjoin 会更容易

library(fuzzyjoin)
stringdist_left_join(df, dictionary, by = c("str_var" = "string")) %>% 
     group_by(grp_var, gene_id = id) %>% 
     summarise(gene_count = sum(num_var), .groups = 'drop')

-输出

# A tibble: 4 x 3
  grp_var gene_id gene_count
  <chr>   <chr>        <dbl>
1 grp1    gene1            6
2 grp1    gene2            3
3 grp2    gene1            3
4 grp2    gene2            6