如何合并数据框中具有相同前缀的行?
How to merge rows with the same prefix in a data frame?
你好,我想弄清楚如何通过前缀匹配合并数据框的行(并对列求和):
示例数据框:
set.seed(42) ## for sake of reproducibility
df <- data.frame(col1=c(sprintf("gene%s", 1:3), sprintf("protein%s", 1:5), sprintf("lipid%s", 1:3)),
counts=runif(11, min=10, max=70))
df
# col1 counts
# 1 gene1 64.88836
# 2 gene2 66.22452
# 3 gene3 27.16837
# 4 protein1 59.82686
# 5 protein2 48.50473
# 6 protein3 41.14576
# 7 protein4 54.19530
# 8 protein5 18.08000
# 9 lipid1 49.41954
# 10 lipid2 52.30389
# 11 lipid3 37.46451
所以我希望所有以“基因”开头的行都合并成一行,蛋白质和脂质行也是如此。
期望的输出:
col1 counts
gene 158.2813
lipid 139.1879
protein 221.7526
df %>%
group_by(col1 = str_remove(col1, "\d+"))%>%
summarise(counts = sum(counts))
gsub
的数字,然后 aggregate
使用公式
aggregate(counts ~ gsub('\d+', '', col1), df, sum)
# gsub("\\d", "", col1) counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
或list
表示法。
with(df, aggregate(list(counts=counts), list(col1=gsub('\d+', '', col1)), sum))
# col1 counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
关于字符串生成的旁注:您也可以使用 paste0
作为数字后缀。
paste0("gene", 1:3)
# [1] "gene1" "gene2" "gene3"
数据:
df <- structure(list(col1 = c("gene1", "gene2", "gene3", "protein1",
"protein2", "protein3", "protein4", "protein5", "lipid1", "lipid2",
"lipid3"), counts = c(64.8883626097813, 66.2245247978717, 27.1683720871806,
59.8268575640395, 48.5047311335802, 41.145756947808, 54.195298878476,
18.0799958342686, 49.4195374241099, 52.303887042217, 37.4645065749064
)), class = "data.frame", row.names = c(NA, -11L))
你好,我想弄清楚如何通过前缀匹配合并数据框的行(并对列求和):
示例数据框:
set.seed(42) ## for sake of reproducibility
df <- data.frame(col1=c(sprintf("gene%s", 1:3), sprintf("protein%s", 1:5), sprintf("lipid%s", 1:3)),
counts=runif(11, min=10, max=70))
df
# col1 counts
# 1 gene1 64.88836
# 2 gene2 66.22452
# 3 gene3 27.16837
# 4 protein1 59.82686
# 5 protein2 48.50473
# 6 protein3 41.14576
# 7 protein4 54.19530
# 8 protein5 18.08000
# 9 lipid1 49.41954
# 10 lipid2 52.30389
# 11 lipid3 37.46451
所以我希望所有以“基因”开头的行都合并成一行,蛋白质和脂质行也是如此。
期望的输出:
col1 counts
gene 158.2813
lipid 139.1879
protein 221.7526
df %>%
group_by(col1 = str_remove(col1, "\d+"))%>%
summarise(counts = sum(counts))
gsub
的数字,然后 aggregate
使用公式
aggregate(counts ~ gsub('\d+', '', col1), df, sum)
# gsub("\\d", "", col1) counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
或list
表示法。
with(df, aggregate(list(counts=counts), list(col1=gsub('\d+', '', col1)), sum))
# col1 counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
关于字符串生成的旁注:您也可以使用 paste0
作为数字后缀。
paste0("gene", 1:3)
# [1] "gene1" "gene2" "gene3"
数据:
df <- structure(list(col1 = c("gene1", "gene2", "gene3", "protein1",
"protein2", "protein3", "protein4", "protein5", "lipid1", "lipid2",
"lipid3"), counts = c(64.8883626097813, 66.2245247978717, 27.1683720871806,
59.8268575640395, 48.5047311335802, 41.145756947808, 54.195298878476,
18.0799958342686, 49.4195374241099, 52.303887042217, 37.4645065749064
)), class = "data.frame", row.names = c(NA, -11L))