整理数据,制作字典
Arranging data and making dictionary
我有一个制表符分隔的文件,如下所示:
chr14 106559873 106560782 MA0004.1_Arnt
chr14 106559873 106560782 MA0093.1_USF1
chr14 106559873 106560782 MA0147.1_Myc
chr14 106559873 106560782 RUNX3_DBD_WAACCRCAAWAACCRCAN
chr10 17037867 17038971 MA0080.2_SPI1
chr10 17037867 17038971 MA0152.1_NFATC2
chr17 8610947 8611433 MA0080.2_SPI1
chr17 8610947 8611433 MA0098.1_ETS1
我想这样安排:
Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782 1 1 1 1 0 0 0
chr10;17037867;17038971 0 0 0 0 1 1 0
chr10;17037867;17038971 0 0 0 0 1 0 1
示例输出仅显示前四行,但这需要应用于整个文件。 1 表示存在字符串。
因为这是我正在编写的代码的中间部分,它对我的分析至关重要。我再也想不出如何在 awk 中执行此操作了。
谢谢。
这个 awk 脚本可以帮助您完成大部分工作:
BEGIN {
print "Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1"
a["MA0004.1_Arnt"] = a["MA0093.1_USF1"] = \
a["MA0147.1_Myc"] = a["RUNX3_DBD_WAACCRCAAWAACCRCAN"] = \
a["MA0080.2_SPI1"] = a["MA0152.1_NFATC2"] = a["MA0098.1_ETS1"] = 0
}
function print_fields () {
print p";"s";"e, a["MA0004.1_Arnt"], a["MA0093.1_USF1"],
a["MA0147.1_Myc"], a["RUNX3_DBD_WAACCRCAAWAACCRCAN"],
a["MA0080.2_SPI1"], a["MA0152.1_NFATC2"], a["MA0098.1_ETS1"]
}
NR>1&&!=p {
print_fields()
for (i in a) a[i] = 0
}
{ p=; s=; e=; a[]=1 }
END { print_fields() }
正在测试:
$ awk -f script.awk file
Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782 1 1 1 1 0 0 0
chr10;17037867;17038971 0 0 0 0 1 1 0
chr17;8610947;8611433 0 0 0 0 1 0 1
我有一个制表符分隔的文件,如下所示:
chr14 106559873 106560782 MA0004.1_Arnt
chr14 106559873 106560782 MA0093.1_USF1
chr14 106559873 106560782 MA0147.1_Myc
chr14 106559873 106560782 RUNX3_DBD_WAACCRCAAWAACCRCAN
chr10 17037867 17038971 MA0080.2_SPI1
chr10 17037867 17038971 MA0152.1_NFATC2
chr17 8610947 8611433 MA0080.2_SPI1
chr17 8610947 8611433 MA0098.1_ETS1
我想这样安排:
Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782 1 1 1 1 0 0 0
chr10;17037867;17038971 0 0 0 0 1 1 0
chr10;17037867;17038971 0 0 0 0 1 0 1
示例输出仅显示前四行,但这需要应用于整个文件。 1 表示存在字符串。
因为这是我正在编写的代码的中间部分,它对我的分析至关重要。我再也想不出如何在 awk 中执行此操作了。
谢谢。
这个 awk 脚本可以帮助您完成大部分工作:
BEGIN {
print "Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1"
a["MA0004.1_Arnt"] = a["MA0093.1_USF1"] = \
a["MA0147.1_Myc"] = a["RUNX3_DBD_WAACCRCAAWAACCRCAN"] = \
a["MA0080.2_SPI1"] = a["MA0152.1_NFATC2"] = a["MA0098.1_ETS1"] = 0
}
function print_fields () {
print p";"s";"e, a["MA0004.1_Arnt"], a["MA0093.1_USF1"],
a["MA0147.1_Myc"], a["RUNX3_DBD_WAACCRCAAWAACCRCAN"],
a["MA0080.2_SPI1"], a["MA0152.1_NFATC2"], a["MA0098.1_ETS1"]
}
NR>1&&!=p {
print_fields()
for (i in a) a[i] = 0
}
{ p=; s=; e=; a[]=1 }
END { print_fields() }
正在测试:
$ awk -f script.awk file
Regions MA0004.1_Arnt MA0093.1_USF1 MA0147.1_Myc RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782 1 1 1 1 0 0 0
chr10;17037867;17038971 0 0 0 0 1 1 0
chr17;8610947;8611433 0 0 0 0 1 0 1