R 脚本批处理目录中的所有 .tsv 文件以具有包含来自其他列的信息的新列

R script to batch all .tsv files in a directory to have a new column with information from other columns

我想将几个步骤合并到一个 R 脚本中来执行以下操作:

  1. 一个接一个地加载 .tsv 文件(一个目录中有数百个)
  2. 融合这些文件中的 3 个特定列以创建一个新列 "Fusion"
  3. 将这些文件输出到旧的 .tsv 文件中(这样我就不会得到数百个新文件)

下面的步骤是可行的,但是恐怕很笨拙(我真的不会编码)而且它们不是批处理的,必须一个接一个地放。

test <- read.table(
   "1.tsv",
   sep="\t", header=TRUE)

test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)

write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
                 eol = "\n", na = "NA", dec = ".", row.names = TRUE,
                 col.names = TRUE, qmethod = c("escape", "double"),
                 fileEncoding = "")

如您所见,文件必须一次一个地手动放入,数据框 "test" 也显得多余 (?)。

如果有人可以将这些放在一个脚本中,那就太好了,它只使用 R 的工作目录并一个一个地浏览文件,添加一个新的 "Fusion" 列,写入新的 . tsv 文件并继续。

非常感谢您的帮助!

下面是我将使用您的方法为 pwd 中的每个文件循环您的代码。请确保您 运行 此脚本位于包含目标 .tsv 文件的目录中。

#!/usr/bin/Rscript

print(getwd()) ## print the pwd to the standard output to ensure that you are in the
               ## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Paramatarized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
                 ## you can always delete them later

fuseColumns<-function(file,cols2fuse){
    test<-read.table(file,sep="\t",header=TRUE)
    test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
    write.table(test,
                file =paste0(prefix,file), # only works if preformed in pwd
                                        ## otherwise you may end up with
                                        ## Something like: 
                                        ## "fused-/home/username/file/1.tsv"                 
                sep = "\t",
                quote = TRUE, ## this will suround each output 
                              ## value in quotes.
                              ## this may not be desirable
                row.names = TRUE, ## Do you really want the row names
                                  ## included?
                col.names = TRUE)
    file ## return the file that has been edited (this will show up in stdout
}

lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
                                    ## columns with names that
                                    ## match those in cols2fuse

示例输入

amino_acid  v_gene  j_gene
amino1  ENS0001001  ENS0002001
amino2  ENS0003001  ENS0004001
amino3  ENS0005001  ENS0006001
amino4  ENS0007001  ENS0008001

改造成

"amino_acid"    "v_gene"    "j_gene"    "Fusion"
"1" "amino1"    "ENS0001001"    "ENS0002001"    "amino1ENS0001001ENS0002001"
"2" "amino2"    "ENS0003001"    "ENS0004001"    "amino2ENS0003001ENS0004001"
"3" "amino3"    "ENS0005001"    "ENS0006001"    "amino3ENS0005001ENS0006001"
"4" "amino4"    "ENS0007001"    "ENS0008001"    "amino4ENS0007001ENS0008001"

要删除每个元素周围的引号集 quoteFALSE 并删除每行开头的数字集 row.namesFALSE 作为出色地。

write.table(test,
            file =paste0(prefix,file),                  
            sep = "\t",
            quote = FALSE,
            row.names = FALSE,                       
            col.names = TRUE)

输出现在看起来像

amino_acid  v_gene  j_gene  Fusion
amino1  ENS0001001  ENS0002001  amino1ENS0001001ENS0002001
amino2  ENS0003001  ENS0004001  amino2ENS0003001ENS0004001
amino3  ENS0005001  ENS0006001  amino3ENS0005001ENS0006001
amino4  ENS0007001  ENS0008001  amino4ENS0007001ENS0008001

我不确定您所说的冗余是否意味着您希望删除三列并仅显示融合的列?

您可以使用类似这样的方法来识别冗余列

redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))