加入特定列的数据框

join data frames for specific column

我有几个格式如下的数据框。我想 join/merge 数据帧 species 并从所有数据帧中提取 kmers ,这样输出包含一列 species 和多列 kmers , 每个文件一个表格。 kmers 列将给出其来源文件的名称。 df1

reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus

df2

reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus

出来

species df1 df2
Bacteria 23234 56
Virus 4545 12

我尝试使用 join_all 制作脚本,但它 select 不正确的列 (kmers):

file_list = list.files(pattern="tsv$")    

datalist = lapply(file_list, function(x){
  dat = read.csv(file=x, header=T, sep = "\t")
  names(dat)[2] = x
  return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )  

我假设您已将文件读入 list of frames,以文件的基本名称命名(删除扩展名)。将帧列表命名为 dfs,我们有

dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))

dfs
# $df1
#   reads taxReads kmers  species
# 1   232     2323 23234 Bacteria
# 2   555       12  4545    Virus
# $df2
#   reads taxReads kmers  species
# 1    12       23    56 Bacteria
# 2   932     1213    12    Virus

从这里开始,两步:

  1. kmers 列重命名为文件名(无扩展名),并过滤掉不需要的列,

    dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs))
    dfs
    # $df1
    #    species   df1
    # 1 Bacteria 23234
    # 2    Virus  4545
    # $df2
    #    species df2
    # 1 Bacteria  56
    # 2    Virus  12
    
  2. 减少 merge

    Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs)
    #    species   df1 df2
    # 1 Bacteria 23234  56
    # 2    Virus  4545  12
    

    这可以在 这里 只用 Reduce(merge, dfs) 进行编码,但我用两个参数的匿名函数打破了它,这样你就可以控制一些merge 的选项。