使用 awk 从 vcf 文件中提取字符串
pull string from a vcf file using awk
我是 运行 以下代码,用于在 vcf table.
中操作数字数据
cat inputfile | while read row; do
echo $row > tmp
originalProb= `awk '{print }' tmp`
probabilityHom1=`awk '{print }' tmp`
probabilityHom2=`awk '{print }' tmp`
numCols=`awk '{print NF}' tmp`
if [ $numCols -gt 4 ]; then
echo "${originalProb}" >> currentRowGenotypes
elif [ "$probabilityHom1" -gt "$probabilityHom2" ]; then
echo "1/1" >> currentRowGenotypes
elif [ "$probabilityHom1" -lt "$probabilityHom2" ]; then
echo "0/0" >> currentRowGenotypes
elif [ "$probabilityHom1" -eq "$probabilityHom2" ] && [ "$probabilityHom1" -eq 0 ]; then
echo "${originalProb}" >> currentRowGenotypes
else
echo "het" >> currentRowGenotypes
fi
done
cat tmpHeaders currentRowGenotypes > currentFullCol
输入文件如下所示
1/1 255 231 0
0/1 255 0 152
0/1 255 0 82
0/1 255 0 151
0/1 239 0 31
0/1 255 0 255
出于某种原因,awk 命令无法识别第一列。有什么建议吗?
创建一个临时文件只是为了使 awk
不是一个好主意
将行拆分为列,因为:
- 逐行创建临时文件会导致开销。
- 它多次生成子进程以调用
awk
。
bash
和awk
之间的语法差异可能是错误的原因。
您可以 不用 使用 awk
。请尝试以下操作:
while read -ra row; do
originalProb="${row[0]}"
probabilityHom1="${row[1]}"
probabilityHom2="${row[3]}"
numCols="${#row}"
if (( numCols > 4 )); then
echo "$originalProb" >> currentRowGenotypes
elif (( probabilityHom1 > probabilityHom2 )); then
echo "1/1" >> currentRowGenotypes
elif (( probabilityHom1 < probabilityHom2 )); then
echo "0/0" >> currentRowGenotypes
elif (( probabilityHom1 == probabilityHom2 && probabilityHom1 == 0 )); then
echo "$originalProb" >> currentRowGenotypes
else
echo "het" >> currentRowGenotypes
fi
done < inputfile
cat tmpHeaders currentRowGenotypes > currentFullCol
正如其他人反复建议的那样,更好的方法是写 awk
:
awk '{
originalProb =
probabilityHom1 =
probabilityHom2 =
numCols = NF
if ( numCols > 4 )
print originalProb >> "currentRowGenotypes"
else if ( probabilityHom1 > probabilityHom2 )
print "1/1" >> "currentRowGenotypes"
else if ( probabilityHom1 < probabilityHom2 )
print "0/0" >> "currentRowGenotypes"
else if ( probabilityHom1 == probabilityHom2 && probabilityHom1 == 0 )
print originalProb >> "currentRowGenotypes"
else
print "het" >> "currentRowGenotypes"
}' inputfile
cat tmpHeaders currentRowGenotypes > currentFullCol
希望对您有所帮助。
为什么不使用Pysam?它非常适合解析 BCF/VCF.
我是 运行 以下代码,用于在 vcf table.
中操作数字数据 cat inputfile | while read row; do
echo $row > tmp
originalProb= `awk '{print }' tmp`
probabilityHom1=`awk '{print }' tmp`
probabilityHom2=`awk '{print }' tmp`
numCols=`awk '{print NF}' tmp`
if [ $numCols -gt 4 ]; then
echo "${originalProb}" >> currentRowGenotypes
elif [ "$probabilityHom1" -gt "$probabilityHom2" ]; then
echo "1/1" >> currentRowGenotypes
elif [ "$probabilityHom1" -lt "$probabilityHom2" ]; then
echo "0/0" >> currentRowGenotypes
elif [ "$probabilityHom1" -eq "$probabilityHom2" ] && [ "$probabilityHom1" -eq 0 ]; then
echo "${originalProb}" >> currentRowGenotypes
else
echo "het" >> currentRowGenotypes
fi
done
cat tmpHeaders currentRowGenotypes > currentFullCol
输入文件如下所示
1/1 255 231 0
0/1 255 0 152
0/1 255 0 82
0/1 255 0 151
0/1 239 0 31
0/1 255 0 255
出于某种原因,awk 命令无法识别第一列。有什么建议吗?
创建一个临时文件只是为了使 awk
不是一个好主意
将行拆分为列,因为:
- 逐行创建临时文件会导致开销。
- 它多次生成子进程以调用
awk
。 bash
和awk
之间的语法差异可能是错误的原因。
您可以 不用 使用 awk
。请尝试以下操作:
while read -ra row; do
originalProb="${row[0]}"
probabilityHom1="${row[1]}"
probabilityHom2="${row[3]}"
numCols="${#row}"
if (( numCols > 4 )); then
echo "$originalProb" >> currentRowGenotypes
elif (( probabilityHom1 > probabilityHom2 )); then
echo "1/1" >> currentRowGenotypes
elif (( probabilityHom1 < probabilityHom2 )); then
echo "0/0" >> currentRowGenotypes
elif (( probabilityHom1 == probabilityHom2 && probabilityHom1 == 0 )); then
echo "$originalProb" >> currentRowGenotypes
else
echo "het" >> currentRowGenotypes
fi
done < inputfile
cat tmpHeaders currentRowGenotypes > currentFullCol
正如其他人反复建议的那样,更好的方法是写 awk
:
awk '{
originalProb =
probabilityHom1 =
probabilityHom2 =
numCols = NF
if ( numCols > 4 )
print originalProb >> "currentRowGenotypes"
else if ( probabilityHom1 > probabilityHom2 )
print "1/1" >> "currentRowGenotypes"
else if ( probabilityHom1 < probabilityHom2 )
print "0/0" >> "currentRowGenotypes"
else if ( probabilityHom1 == probabilityHom2 && probabilityHom1 == 0 )
print originalProb >> "currentRowGenotypes"
else
print "het" >> "currentRowGenotypes"
}' inputfile
cat tmpHeaders currentRowGenotypes > currentFullCol
希望对您有所帮助。
为什么不使用Pysam?它非常适合解析 BCF/VCF.