R 脚本批处理目录中的所有 .tsv 文件以具有包含来自其他列的信息的新列
R script to batch all .tsv files in a directory to have a new column with information from other columns
我想将几个步骤合并到一个 R 脚本中来执行以下操作:
- 一个接一个地加载 .tsv 文件(一个目录中有数百个)
- 融合这些文件中的 3 个特定列以创建一个新列 "Fusion"
- 将这些文件输出到旧的 .tsv 文件中(这样我就不会得到数百个新文件)
下面的步骤是可行的,但是恐怕很笨拙(我真的不会编码)而且它们不是批处理的,必须一个接一个地放。
test <- read.table(
"1.tsv",
sep="\t", header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
如您所见,文件必须一次一个地手动放入,数据框 "test" 也显得多余 (?)。
如果有人可以将这些放在一个脚本中,那就太好了,它只使用 R 的工作目录并一个一个地浏览文件,添加一个新的 "Fusion" 列,写入新的 . tsv 文件并继续。
非常感谢您的帮助!
下面是我将使用您的方法为 pwd 中的每个文件循环您的代码。请确保您 运行 此脚本位于包含目标 .tsv 文件的目录中。
#!/usr/bin/Rscript
print(getwd()) ## print the pwd to the standard output to ensure that you are in the
## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Paramatarized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
## you can always delete them later
fuseColumns<-function(file,cols2fuse){
test<-read.table(file,sep="\t",header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test,
file =paste0(prefix,file), # only works if preformed in pwd
## otherwise you may end up with
## Something like:
## "fused-/home/username/file/1.tsv"
sep = "\t",
quote = TRUE, ## this will suround each output
## value in quotes.
## this may not be desirable
row.names = TRUE, ## Do you really want the row names
## included?
col.names = TRUE)
file ## return the file that has been edited (this will show up in stdout
}
lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
## columns with names that
## match those in cols2fuse
示例输入
amino_acid v_gene j_gene
amino1 ENS0001001 ENS0002001
amino2 ENS0003001 ENS0004001
amino3 ENS0005001 ENS0006001
amino4 ENS0007001 ENS0008001
改造成
"amino_acid" "v_gene" "j_gene" "Fusion"
"1" "amino1" "ENS0001001" "ENS0002001" "amino1ENS0001001ENS0002001"
"2" "amino2" "ENS0003001" "ENS0004001" "amino2ENS0003001ENS0004001"
"3" "amino3" "ENS0005001" "ENS0006001" "amino3ENS0005001ENS0006001"
"4" "amino4" "ENS0007001" "ENS0008001" "amino4ENS0007001ENS0008001"
要删除每个元素周围的引号集 quote
到 FALSE
并删除每行开头的数字集 row.names
到 FALSE
作为出色地。
write.table(test,
file =paste0(prefix,file),
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = TRUE)
输出现在看起来像
amino_acid v_gene j_gene Fusion
amino1 ENS0001001 ENS0002001 amino1ENS0001001ENS0002001
amino2 ENS0003001 ENS0004001 amino2ENS0003001ENS0004001
amino3 ENS0005001 ENS0006001 amino3ENS0005001ENS0006001
amino4 ENS0007001 ENS0008001 amino4ENS0007001ENS0008001
我不确定您所说的冗余是否意味着您希望删除三列并仅显示融合的列?
您可以使用类似这样的方法来识别冗余列
redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))
我想将几个步骤合并到一个 R 脚本中来执行以下操作:
- 一个接一个地加载 .tsv 文件(一个目录中有数百个)
- 融合这些文件中的 3 个特定列以创建一个新列 "Fusion"
- 将这些文件输出到旧的 .tsv 文件中(这样我就不会得到数百个新文件)
下面的步骤是可行的,但是恐怕很笨拙(我真的不会编码)而且它们不是批处理的,必须一个接一个地放。
test <- read.table(
"1.tsv",
sep="\t", header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
如您所见,文件必须一次一个地手动放入,数据框 "test" 也显得多余 (?)。
如果有人可以将这些放在一个脚本中,那就太好了,它只使用 R 的工作目录并一个一个地浏览文件,添加一个新的 "Fusion" 列,写入新的 . tsv 文件并继续。
非常感谢您的帮助!
下面是我将使用您的方法为 pwd 中的每个文件循环您的代码。请确保您 运行 此脚本位于包含目标 .tsv 文件的目录中。
#!/usr/bin/Rscript
print(getwd()) ## print the pwd to the standard output to ensure that you are in the
## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Paramatarized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
## you can always delete them later
fuseColumns<-function(file,cols2fuse){
test<-read.table(file,sep="\t",header=TRUE)
test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
write.table(test,
file =paste0(prefix,file), # only works if preformed in pwd
## otherwise you may end up with
## Something like:
## "fused-/home/username/file/1.tsv"
sep = "\t",
quote = TRUE, ## this will suround each output
## value in quotes.
## this may not be desirable
row.names = TRUE, ## Do you really want the row names
## included?
col.names = TRUE)
file ## return the file that has been edited (this will show up in stdout
}
lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
## columns with names that
## match those in cols2fuse
示例输入
amino_acid v_gene j_gene
amino1 ENS0001001 ENS0002001
amino2 ENS0003001 ENS0004001
amino3 ENS0005001 ENS0006001
amino4 ENS0007001 ENS0008001
改造成
"amino_acid" "v_gene" "j_gene" "Fusion"
"1" "amino1" "ENS0001001" "ENS0002001" "amino1ENS0001001ENS0002001"
"2" "amino2" "ENS0003001" "ENS0004001" "amino2ENS0003001ENS0004001"
"3" "amino3" "ENS0005001" "ENS0006001" "amino3ENS0005001ENS0006001"
"4" "amino4" "ENS0007001" "ENS0008001" "amino4ENS0007001ENS0008001"
要删除每个元素周围的引号集 quote
到 FALSE
并删除每行开头的数字集 row.names
到 FALSE
作为出色地。
write.table(test,
file =paste0(prefix,file),
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = TRUE)
输出现在看起来像
amino_acid v_gene j_gene Fusion
amino1 ENS0001001 ENS0002001 amino1ENS0001001ENS0002001
amino2 ENS0003001 ENS0004001 amino2ENS0003001ENS0004001
amino3 ENS0005001 ENS0006001 amino3ENS0005001ENS0006001
amino4 ENS0007001 ENS0008001 amino4ENS0007001ENS0008001
我不确定您所说的冗余是否意味着您希望删除三列并仅显示融合的列?
您可以使用类似这样的方法来识别冗余列
redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))