如何将基因组 SNP 转换为二进制格式?
How to convert genomic SNPs to a binary format?
我正在努力弄清楚如何转换如下所示的 (SNP) 文件:
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G
像这样的二进制格式:
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0
有人知道 R 中的包或命令行工具、awk 代码可用于此吗?
以下解决方案的一个问题是都这样做:
1097023,A,A,G,G,G,G,G,G,G,G,G,G,A,A
1097027,C,C,C,C,C,C,C,C,C,C,C,C,C,C
4363243,C,A,A,A,A,A,A,A,A,A,A,A,A,A
4363270,T,T,T,T,T,T,T,T,T,T,T,T,T,T
4363275,A,G,G,G,G,G,G,G,G,G,G,G,G,G
1097023,0,0,1,1,1,1,1,1,1,1,1,1,0,0
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,0,1,1,1,1,1,1,1,1,1,1,1,1,1
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363275,0,1,1,1,1,1,1,1,1,1,1,1,1,1
虽然应该是:
1097023,1,1,0,0,0,0,0,0,0,0,0,0,1,1
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363275,1,0,0,0,0,0,0,0,0,0,0,0,0,0
所以我认为解决方案应该包括如果大多数样本 'i.e. > 7 samples' 相同,则变为 0,不符合多数的变为 1。
这是一个简单的awk
(标准Linuxawk/gawk
)脚本:
script.awk
BEGIN {FS = ","} # set field seperator to ","
NR>1{ # every line except the first line
zeroIndc=; # identify the zero indicator and save the variable
for (i = 2; i <= NF; i++) { # for each DNA letter
if ($i == zeroIndc) { # if DNA letter match zero indicator
$i = 0; # set DNA letter to 0
} else { # if DNA letter not match zero indicator
$i = 1; # set DNA letter to 1
}
}
print; # print new line at end
}
运行 脚本 script.awk
awk -f scirpt.awk input.txt
运行 单行 awk 脚本
awk 'BEGIN{FS=","}NR>1{z=;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt
使用 base R,您可以将每个 sample
列与第一个样本列进行比较,并在值不匹配的地方将 return 转为 1。
cols <- grep('sample', names(df))
df[cols] <- +(df$sample1 != df[cols])
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107 0 0 0 0 0 0 0 0 0
#2 79115 0 0 0 0 0 0 0 1 0
#3 79116 0 0 0 0 0 0 0 0 0
#4 79124 0 0 1 1 0 0 0 0 0
#5 79128 0 0 0 0 1 0 0 0 0
# sample10 sample11 sample12 sample13 sample14
#1 0 0 0 1 1
#2 0 1 0 0 0
#3 0 0 0 0 0
#4 0 0 0 0 0
#5 1 0 0 0 0
虽然上面的方法在大型数据集上会更有效,但这里有一个 dplyr
库的替代方法。
library(dplyr)
df <- df %>% mutate(across(contains('sample'), ~+(sample1 != .)))
数据
df <- structure(list(pos = c(79107L, 79115L, 79116L, 79124L, 79128L
), sample1 = c("C", "C", "A", "C", "G"), sample2 = c("C", "C",
"A", "C", "G"), sample3 = c("C", "C", "A", "T", "G"), sample4 = c("C",
"C", "A", "T", "G"), sample5 = c("C", "C", "A", "C", "A"), sample6 = c("C",
"C", "A", "C", "G"), sample7 = c("C", "C", "A", "C", "G"), sample8 = c("C",
"A", "A", "C", "G"), sample9 = c("C", "C", "A", "C", "G"), sample10 = c("C",
"C", "A", "C", "C"), sample11 = c("C", "T", "A", "C", "G"), sample12 = c("C",
"C", "A", "C", "G"), sample13 = c("G", "C", "A", "C", "G"), sample14 = c("G",
"C", "A", "C", "G")), class = "data.frame", row.names = c(NA, -5L))
$ awk -F, 'NR>1{gsub(,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0
假设您想将“参考”等位基因编码为 0
,其中“参考”是最常见的,这里是使用 Python 3 的解决方案,其 pandas
数据处理库(使用pip3 install pandas
安装)和Counter
specialized dictionary.
从包含示例数据的文件 snps_letters.csv
开始:
$ cat snps_letters.csv
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G
我们将打开它,转换它并保存在脚本中binarize.py
:
#!/usr/bin/env python3
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv", index_col="pos")
def binarize(column):
# Extracting the most common element as ref
# (using list and tuple unpacking,
# because the most_common method returns
# a list of (element, count) pairs):
[(ref, _)] = Counter(column).most_common(1)
# We can now define an auxiliary function
# to recode individual elements in the column:
def to_bin(elem):
# int(True) -> 1, int(False) -> 0
return 1 - int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")
正在执行 shell 中的脚本:
$ ./binarize.py
检查保存的结果:
$ cat snps_binary.csv
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79115,0,0,0,0,0,0,0,1,0,0,1,0,1,1
79116,1,1,1,1,1,1,1,1,1,1,1,1,1,1
79124,0,0,1,1,0,0,0,0,0,0,0,0,1,1
79128,1,1,1,1,1,1,1,1,1,0,1,1,0,0
要获得更多 re-usable 脚本,您可以将输入和输出文件路径作为参数传递给命令行,它们可以在 python 代码中作为 sys.argv[1]
和 sys.argv[2]
(前提是你先 import sys
)。
在 command-line 上获取输入和输出文件并将参考核苷酸编码为 1 而不是 0 的版本:
#!/usr/bin/env python3
import sys
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1], index_col="pos")
def binarize(column):
# Extracting the most common element as ref
# (using list and tuple unpacking,
# because the most_common method returns
# a list of (element, count) pairs):
[(ref, _)] = Counter(column).most_common(1)
# We can now define an auxiliary function
# to recode individual elements in the column:
def to_bin(elem):
# int(True) -> 1, int(False) -> 0
return int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])
它将按如下方式使用:
./binarize.py snps_letters.csv snps_binary.csv
识别参考等位基因,在这种情况下,参考基于最常见的等位基因,使用 ,我们有 5 个 SNP REF 等位基因:
REF <- apply(df[, -1], 1, function(i) names(which.max(table(i))))
# [1] "C" "C" "A" "C" "G"
现在,逐列比较每个样本与 REF,然后通过将逻辑 TRUE/FALSE 乘以 1 转换为整数。并将其分配回原始 数据框:
df[, -1] <- (df[, -1] != REF) * 1
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107 0 0 0 0 0 0 0 0 0 0 0 0 1 1
# 2 79115 0 0 0 0 0 0 0 1 0 0 1 0 0 0
# 3 79116 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 4 79124 0 0 1 1 0 0 0 0 0 0 0 0 0 0
# 5 79128 0 0 0 0 1 0 0 0 0 1 0 0 0 0
我正在努力弄清楚如何转换如下所示的 (SNP) 文件:
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G
像这样的二进制格式:
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0
有人知道 R 中的包或命令行工具、awk 代码可用于此吗?
以下解决方案的一个问题是都这样做:
1097023,A,A,G,G,G,G,G,G,G,G,G,G,A,A
1097027,C,C,C,C,C,C,C,C,C,C,C,C,C,C
4363243,C,A,A,A,A,A,A,A,A,A,A,A,A,A
4363270,T,T,T,T,T,T,T,T,T,T,T,T,T,T
4363275,A,G,G,G,G,G,G,G,G,G,G,G,G,G
1097023,0,0,1,1,1,1,1,1,1,1,1,1,0,0
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,0,1,1,1,1,1,1,1,1,1,1,1,1,1
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363275,0,1,1,1,1,1,1,1,1,1,1,1,1,1
虽然应该是:
1097023,1,1,0,0,0,0,0,0,0,0,0,0,1,1
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363275,1,0,0,0,0,0,0,0,0,0,0,0,0,0
所以我认为解决方案应该包括如果大多数样本 'i.e. > 7 samples' 相同,则变为 0,不符合多数的变为 1。
这是一个简单的awk
(标准Linuxawk/gawk
)脚本:
script.awk
BEGIN {FS = ","} # set field seperator to ","
NR>1{ # every line except the first line
zeroIndc=; # identify the zero indicator and save the variable
for (i = 2; i <= NF; i++) { # for each DNA letter
if ($i == zeroIndc) { # if DNA letter match zero indicator
$i = 0; # set DNA letter to 0
} else { # if DNA letter not match zero indicator
$i = 1; # set DNA letter to 1
}
}
print; # print new line at end
}
运行 脚本 script.awk
awk -f scirpt.awk input.txt
运行 单行 awk 脚本
awk 'BEGIN{FS=","}NR>1{z=;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt
使用 base R,您可以将每个 sample
列与第一个样本列进行比较,并在值不匹配的地方将 return 转为 1。
cols <- grep('sample', names(df))
df[cols] <- +(df$sample1 != df[cols])
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107 0 0 0 0 0 0 0 0 0
#2 79115 0 0 0 0 0 0 0 1 0
#3 79116 0 0 0 0 0 0 0 0 0
#4 79124 0 0 1 1 0 0 0 0 0
#5 79128 0 0 0 0 1 0 0 0 0
# sample10 sample11 sample12 sample13 sample14
#1 0 0 0 1 1
#2 0 1 0 0 0
#3 0 0 0 0 0
#4 0 0 0 0 0
#5 1 0 0 0 0
虽然上面的方法在大型数据集上会更有效,但这里有一个 dplyr
库的替代方法。
library(dplyr)
df <- df %>% mutate(across(contains('sample'), ~+(sample1 != .)))
数据
df <- structure(list(pos = c(79107L, 79115L, 79116L, 79124L, 79128L
), sample1 = c("C", "C", "A", "C", "G"), sample2 = c("C", "C",
"A", "C", "G"), sample3 = c("C", "C", "A", "T", "G"), sample4 = c("C",
"C", "A", "T", "G"), sample5 = c("C", "C", "A", "C", "A"), sample6 = c("C",
"C", "A", "C", "G"), sample7 = c("C", "C", "A", "C", "G"), sample8 = c("C",
"A", "A", "C", "G"), sample9 = c("C", "C", "A", "C", "G"), sample10 = c("C",
"C", "A", "C", "C"), sample11 = c("C", "T", "A", "C", "G"), sample12 = c("C",
"C", "A", "C", "G"), sample13 = c("G", "C", "A", "C", "G"), sample14 = c("G",
"C", "A", "C", "G")), class = "data.frame", row.names = c(NA, -5L))
$ awk -F, 'NR>1{gsub(,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0
假设您想将“参考”等位基因编码为 0
,其中“参考”是最常见的,这里是使用 Python 3 的解决方案,其 pandas
数据处理库(使用pip3 install pandas
安装)和Counter
specialized dictionary.
从包含示例数据的文件 snps_letters.csv
开始:
$ cat snps_letters.csv
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G
我们将打开它,转换它并保存在脚本中binarize.py
:
#!/usr/bin/env python3
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv", index_col="pos")
def binarize(column):
# Extracting the most common element as ref
# (using list and tuple unpacking,
# because the most_common method returns
# a list of (element, count) pairs):
[(ref, _)] = Counter(column).most_common(1)
# We can now define an auxiliary function
# to recode individual elements in the column:
def to_bin(elem):
# int(True) -> 1, int(False) -> 0
return 1 - int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")
正在执行 shell 中的脚本:
$ ./binarize.py
检查保存的结果:
$ cat snps_binary.csv
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79115,0,0,0,0,0,0,0,1,0,0,1,0,1,1
79116,1,1,1,1,1,1,1,1,1,1,1,1,1,1
79124,0,0,1,1,0,0,0,0,0,0,0,0,1,1
79128,1,1,1,1,1,1,1,1,1,0,1,1,0,0
要获得更多 re-usable 脚本,您可以将输入和输出文件路径作为参数传递给命令行,它们可以在 python 代码中作为 sys.argv[1]
和 sys.argv[2]
(前提是你先 import sys
)。
在 command-line 上获取输入和输出文件并将参考核苷酸编码为 1 而不是 0 的版本:
#!/usr/bin/env python3
import sys
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1], index_col="pos")
def binarize(column):
# Extracting the most common element as ref
# (using list and tuple unpacking,
# because the most_common method returns
# a list of (element, count) pairs):
[(ref, _)] = Counter(column).most_common(1)
# We can now define an auxiliary function
# to recode individual elements in the column:
def to_bin(elem):
# int(True) -> 1, int(False) -> 0
return int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])
它将按如下方式使用:
./binarize.py snps_letters.csv snps_binary.csv
识别参考等位基因,在这种情况下,参考基于最常见的等位基因,使用
REF <- apply(df[, -1], 1, function(i) names(which.max(table(i))))
# [1] "C" "C" "A" "C" "G"
现在,逐列比较每个样本与 REF,然后通过将逻辑 TRUE/FALSE 乘以 1 转换为整数。并将其分配回原始 数据框:
df[, -1] <- (df[, -1] != REF) * 1
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107 0 0 0 0 0 0 0 0 0 0 0 0 1 1
# 2 79115 0 0 0 0 0 0 0 1 0 0 1 0 0 0
# 3 79116 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 4 79124 0 0 1 1 0 0 0 0 0 0 0 0 0 0
# 5 79128 0 0 0 0 1 0 0 0 0 1 0 0 0 0