加入特定列的数据框
join data frames for specific column
我有几个格式如下的数据框。我想 join/merge 数据帧 species
并从所有数据帧中提取 kmers
,这样输出包含一列 species
和多列 kmers
, 每个文件一个表格。 kmers
列将给出其来源文件的名称。
df1
reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus
df2
reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus
出来
species df1 df2
Bacteria 23234 56
Virus 4545 12
我尝试使用 join_all 制作脚本,但它 select 不正确的列 (kmers
):
file_list = list.files(pattern="tsv$")
datalist = lapply(file_list, function(x){
dat = read.csv(file=x, header=T, sep = "\t")
names(dat)[2] = x
return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )
我假设您已将文件读入 list of frames,以文件的基本名称命名(删除扩展名)。将帧列表命名为 dfs
,我们有
dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))
dfs
# $df1
# reads taxReads kmers species
# 1 232 2323 23234 Bacteria
# 2 555 12 4545 Virus
# $df2
# reads taxReads kmers species
# 1 12 23 56 Bacteria
# 2 932 1213 12 Virus
从这里开始,两步:
将 kmers
列重命名为文件名(无扩展名),并过滤掉不需要的列,
dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs))
dfs
# $df1
# species df1
# 1 Bacteria 23234
# 2 Virus 4545
# $df2
# species df2
# 1 Bacteria 56
# 2 Virus 12
减少 merge
。
Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs)
# species df1 df2
# 1 Bacteria 23234 56
# 2 Virus 4545 12
这可以在 这里 只用 Reduce(merge, dfs)
进行编码,但我用两个参数的匿名函数打破了它,这样你就可以控制一些merge
的选项。
我有几个格式如下的数据框。我想 join/merge 数据帧 species
并从所有数据帧中提取 kmers
,这样输出包含一列 species
和多列 kmers
, 每个文件一个表格。 kmers
列将给出其来源文件的名称。
df1
reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus
df2
reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus
出来
species df1 df2
Bacteria 23234 56
Virus 4545 12
我尝试使用 join_all 制作脚本,但它 select 不正确的列 (kmers
):
file_list = list.files(pattern="tsv$")
datalist = lapply(file_list, function(x){
dat = read.csv(file=x, header=T, sep = "\t")
names(dat)[2] = x
return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )
我假设您已将文件读入 list of frames,以文件的基本名称命名(删除扩展名)。将帧列表命名为 dfs
,我们有
dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))
dfs
# $df1
# reads taxReads kmers species
# 1 232 2323 23234 Bacteria
# 2 555 12 4545 Virus
# $df2
# reads taxReads kmers species
# 1 12 23 56 Bacteria
# 2 932 1213 12 Virus
从这里开始,两步:
将
kmers
列重命名为文件名(无扩展名),并过滤掉不需要的列,dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs)) dfs # $df1 # species df1 # 1 Bacteria 23234 # 2 Virus 4545 # $df2 # species df2 # 1 Bacteria 56 # 2 Virus 12
减少
merge
。Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs) # species df1 df2 # 1 Bacteria 23234 56 # 2 Virus 4545 12
这可以在 这里 只用
Reduce(merge, dfs)
进行编码,但我用两个参数的匿名函数打破了它,这样你就可以控制一些merge
的选项。