如何合并数据框中具有相同前缀的行?

How to merge rows with the same prefix in a data frame?

你好,我想弄清楚如何通过前缀匹配合并数据框的行(并对列求和):

示例数据框:

set.seed(42)  ## for sake of reproducibility
df <- data.frame(col1=c(sprintf("gene%s", 1:3), sprintf("protein%s", 1:5), sprintf("lipid%s", 1:3)), 
                 counts=runif(11, min=10, max=70))
df
#        col1   counts
# 1     gene1 64.88836
# 2     gene2 66.22452
# 3     gene3 27.16837
# 4  protein1 59.82686
# 5  protein2 48.50473
# 6  protein3 41.14576
# 7  protein4 54.19530
# 8  protein5 18.08000
# 9    lipid1 49.41954
# 10   lipid2 52.30389
# 11   lipid3 37.46451

所以我希望所有以“基因”开头的行都合并成一行,蛋白质和脂质行也是如此。

期望的输出:

    col1   counts
    gene 158.2813
   lipid 139.1879
 protein 221.7526
df %>%
  group_by(col1 = str_remove(col1, "\d+"))%>%
  summarise(counts = sum(counts))

gsub 的数字,然后 aggregate 使用公式

aggregate(counts ~ gsub('\d+', '', col1), df, sum)
#   gsub("\\d", "", col1)   counts
# 1                    gene 158.2813
# 2                   lipid 139.1879
# 3                 protein 221.7526

list表示法。

with(df, aggregate(list(counts=counts), list(col1=gsub('\d+', '', col1)), sum))
#      col1   counts
# 1    gene 158.2813
# 2   lipid 139.1879
# 3 protein 221.7526

关于字符串生成的旁注:您也可以使用 paste0 作为数字后缀。

paste0("gene", 1:3)
# [1] "gene1" "gene2" "gene3"

数据:

df <- structure(list(col1 = c("gene1", "gene2", "gene3", "protein1", 
"protein2", "protein3", "protein4", "protein5", "lipid1", "lipid2", 
"lipid3"), counts = c(64.8883626097813, 66.2245247978717, 27.1683720871806, 
59.8268575640395, 48.5047311335802, 41.145756947808, 54.195298878476, 
18.0799958342686, 49.4195374241099, 52.303887042217, 37.4645065749064
)), class = "data.frame", row.names = c(NA, -11L))