如何获取至少存在于两个或更多文件中的公共行?
How can I get common rows that exist in at lest two files or more?
我有七个测试文件。他们看起来像以下
文件 1
chr start end strand
chr1 10525 10525 +
chr1 10542 10542 +
chr1 10571 10571 +
chr1 10577 10577 +
chr2 10589 10589 +
chr2 565262 565262 +
chr2 565397 565397 +
chr3 567239 567239 +
chr3 567312 567312 +
chr4 567348 567348 +
如何获取至少两个文件中以下格式的公共行
chr start end strand File1 File2 File3 File4 File5 File6 File7
chr1 10525 10525 + 0 1 0 0 0 1 1
chr1 10542 10542 + 1 1 1 1 1 0 0
chr1 10571 10571 + 0 1 0 1 1 0 0
chr3 10577 10577 + 1 1 0 0 0 1 0
chr3 10589 10589 + 0 0 1 0 1 0 1
chr4 565262 565262 + 1 0 0 1 1 1 1
“1”代表给定文件中存在的行,“0”代表给定文件中确实存在的行。我不想显示在任何文件中不常见的行。
使用 awk:
awk '
FNR==1{ #Header line:
fn[++i]=FILENAME; # record filenames
fn[0]=[=10=]; # & file header
}
(FNR>1){ # For lines other than header lines
list[[=10=]]++; # Record line
file_list[[=10=] FILENAME]++; # Record which file has that line
}
END{
for(t=0;t<=i;t++) printf "%s\t", fn[t]; # Print header & file names
print ""; # Quick hack for printing newline.
for(t in list){ # For every line that occurred in any of the files
if (list[t]>=2){ # If count is >= 2
printf "%s\t", t; # Print line
for(j=1;j<=i;j++) {
printf "%d\t", file_list[t fn[j]]; # Print per file occurrence count.
}
print "" # Print newline.
}
}
}' File{1..7}
我有七个测试文件。他们看起来像以下
文件 1
chr start end strand
chr1 10525 10525 +
chr1 10542 10542 +
chr1 10571 10571 +
chr1 10577 10577 +
chr2 10589 10589 +
chr2 565262 565262 +
chr2 565397 565397 +
chr3 567239 567239 +
chr3 567312 567312 +
chr4 567348 567348 +
如何获取至少两个文件中以下格式的公共行
chr start end strand File1 File2 File3 File4 File5 File6 File7
chr1 10525 10525 + 0 1 0 0 0 1 1
chr1 10542 10542 + 1 1 1 1 1 0 0
chr1 10571 10571 + 0 1 0 1 1 0 0
chr3 10577 10577 + 1 1 0 0 0 1 0
chr3 10589 10589 + 0 0 1 0 1 0 1
chr4 565262 565262 + 1 0 0 1 1 1 1
“1”代表给定文件中存在的行,“0”代表给定文件中确实存在的行。我不想显示在任何文件中不常见的行。
使用 awk:
awk '
FNR==1{ #Header line:
fn[++i]=FILENAME; # record filenames
fn[0]=[=10=]; # & file header
}
(FNR>1){ # For lines other than header lines
list[[=10=]]++; # Record line
file_list[[=10=] FILENAME]++; # Record which file has that line
}
END{
for(t=0;t<=i;t++) printf "%s\t", fn[t]; # Print header & file names
print ""; # Quick hack for printing newline.
for(t in list){ # For every line that occurred in any of the files
if (list[t]>=2){ # If count is >= 2
printf "%s\t", t; # Print line
for(j=1;j<=i;j++) {
printf "%d\t", file_list[t fn[j]]; # Print per file occurrence count.
}
print "" # Print newline.
}
}
}' File{1..7}