如何使用 grep 来匹配精确的单词
How to used grep to match exact words
我在 zsh 中有一个 list/array,它是 house=$(cat corrected_inhouse_list.txt)
包含:
N-METHYL-L-GLUTAMIC ACID
L-GLUTAMIC ACID
CREATINE
L-PROLINE
CREATINE PHOSPHATE
L-VALINE
L-TYROSINE
L-KYNURENINE
L-PHENYLALANINE
PHENYLETHANOLAMINE
D-PANTOTHENIC ACID
L-TRYPTOPHAN
MYRISTIC ACID
文件“metexplore_IDs_DB.tsv”:
8:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
21:M_glu_L exact multimatching 1 L-GLUTAMIC ACID CHEBI:16015 NA NA
40:M_trp_L exact multimatching 1 L-TRYPTOPHAN CHEBI:16828 NA NA
42:M_pro_L exact multimatching 1 L-PROLINE CHEBI:17203 NA NA
50:M_phe_L exact multimatching 1 L-PHENYLALANINE CHEBI:17295 NA NA
56:M_creat exact multimatching 1 CREATINE CHEBI:16919 NA NA
57:M_34dhphe exact multimatching 1 3,4-DIHYDROXY-L-PHENYLALANINE (L-DOPA) CHEBI:15765 NA NA
61:M_tyr_L exact multimatching 1 L-TYROSINE CHEBI:17895 NA NA
63:M_val_L exact multimatching 1 L-VALINE CHEBI:16414 NA NA
94:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
95:M_5oxpro exact multimatching 1 5-OXO-L-PROLINE CHEBI:18183 NA NA
107:M_4hpro_LT exact multimatching 1 4-HYDROXY-L-PROLINE CHEBI:18095 NANA
171:M_pcreat exact multimatching 1 PHOSPHOCREATINE CHEBI:17287 NA NA
191:M_pnto_R exact multimatching 1 D-PANTOTHENIC ACID CHEBI:7916 NANA
211:M_pcreat exact multimatching 1 CREATINE PHOSPHATE CHEBI:17287 NANA
237:M_35diotyr exact multimatching 1 3,5-DIIODO-L-TYROSINE CHEBI:15768 NANA
315:M_ttdca exact multimatching 1 MYRISTIC ACID CHEBI:28875
而且我想使用 grep 来匹配文件中的这个词。问题是在图片中可以看到什么,grep 还会捕获包含但不以我感兴趣的单词开头的单词。
我试过:
for i in ${house[*]}; do grep -n -E "^\s*\{$i}\>" metexplore_IDs_DB.tsv; done
for i in ${house[*]}; do grep -n -E -w "\<$i" metexplore_IDs_DB.tsv; done
for i in ${house[*]}; do grep -n -E "(^|\t)$i" metexplore_IDs_DB.tsv; done
我可以做些什么来实现我的目标?所需的输出将没有第 57、95、107 和 237 行。
看起来你总是匹配字段 4。所以 awk
会是一个更好的解决方案,因为你可以简单地对整个字段进行精确匹配:
for i in "${house[@]}"; do
awk -F'\t' -v i="$i" ' == i' metexplore_ID.tsv
done
不要忘记 ${house[@]}
两边的引号;否则像 L-GLUTAMIC ACID
这样的元素将被视为两个不同的词来匹配。
您还可以通过将 corrected_inhouse_list.txt
直接加载到 awk
数组来避免创建数组和循环:
awk -F'\t' -v i="$i" '
NR == FNR {houses[[=11=]]++; next}
in houses' corrected_inhouse_list.txt metexplore_ID.tsv
您可以考虑这个 awk
为列表中的每个条目构建正则表达式,然后在 csv 文件中任何地方 搜索该正则表达式:
awk '
NR == FNR {
kw[ "(^|[[:blank:]])" [=10=] "([[:blank:]]|$)" ]
next
}
{
for (w in kw)
if ( [=10=] ~ w ) {
print
next
}
}' corrected_inhouse_list.txt metexplore_ID.tsv
i8:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
21:M_glu_L exact multimatching 1 L-GLUTAMIC ACID CHEBI:16015 NA NA
40:M_trp_L exact multimatching 1 L-TRYPTOPHAN CHEBI:16828 NA NA
42:M_pro_L exact multimatching 1 L-PROLINE CHEBI:17203 NA NA
50:M_phe_L exact multimatching 1 L-PHENYLALANINE CHEBI:17295 NA NA
56:M_creat exact multimatching 1 CREATINE CHEBI:16919 NA NA
61:M_tyr_L exact multimatching 1 L-TYROSINE CHEBI:17895 NA NA
63:M_val_L exact multimatching 1 L-VALINE CHEBI:16414 NA NA
94:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
191:M_pnto_R exact multimatching 1 D-PANTOTHENIC ACID CHEBI:7916 NANA
211:M_pcreat exact multimatching 1 CREATINE PHOSPHATE CHEBI:17287 NANA
315:M_ttdca exact multimatching 1 MYRISTIC ACID CHEBI:28875
我在 zsh 中有一个 list/array,它是 house=$(cat corrected_inhouse_list.txt)
包含:
N-METHYL-L-GLUTAMIC ACID
L-GLUTAMIC ACID
CREATINE
L-PROLINE
CREATINE PHOSPHATE
L-VALINE
L-TYROSINE
L-KYNURENINE
L-PHENYLALANINE
PHENYLETHANOLAMINE
D-PANTOTHENIC ACID
L-TRYPTOPHAN
MYRISTIC ACID
文件“metexplore_IDs_DB.tsv”:
8:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
21:M_glu_L exact multimatching 1 L-GLUTAMIC ACID CHEBI:16015 NA NA
40:M_trp_L exact multimatching 1 L-TRYPTOPHAN CHEBI:16828 NA NA
42:M_pro_L exact multimatching 1 L-PROLINE CHEBI:17203 NA NA
50:M_phe_L exact multimatching 1 L-PHENYLALANINE CHEBI:17295 NA NA
56:M_creat exact multimatching 1 CREATINE CHEBI:16919 NA NA
57:M_34dhphe exact multimatching 1 3,4-DIHYDROXY-L-PHENYLALANINE (L-DOPA) CHEBI:15765 NA NA
61:M_tyr_L exact multimatching 1 L-TYROSINE CHEBI:17895 NA NA
63:M_val_L exact multimatching 1 L-VALINE CHEBI:16414 NA NA
94:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
95:M_5oxpro exact multimatching 1 5-OXO-L-PROLINE CHEBI:18183 NA NA
107:M_4hpro_LT exact multimatching 1 4-HYDROXY-L-PROLINE CHEBI:18095 NANA
171:M_pcreat exact multimatching 1 PHOSPHOCREATINE CHEBI:17287 NA NA
191:M_pnto_R exact multimatching 1 D-PANTOTHENIC ACID CHEBI:7916 NANA
211:M_pcreat exact multimatching 1 CREATINE PHOSPHATE CHEBI:17287 NANA
237:M_35diotyr exact multimatching 1 3,5-DIIODO-L-TYROSINE CHEBI:15768 NANA
315:M_ttdca exact multimatching 1 MYRISTIC ACID CHEBI:28875
而且我想使用 grep 来匹配文件中的这个词。问题是在图片中可以看到什么,grep 还会捕获包含但不以我感兴趣的单词开头的单词。
我试过:
for i in ${house[*]}; do grep -n -E "^\s*\{$i}\>" metexplore_IDs_DB.tsv; done
for i in ${house[*]}; do grep -n -E -w "\<$i" metexplore_IDs_DB.tsv; done
for i in ${house[*]}; do grep -n -E "(^|\t)$i" metexplore_IDs_DB.tsv; done
我可以做些什么来实现我的目标?所需的输出将没有第 57、95、107 和 237 行。
看起来你总是匹配字段 4。所以 awk
会是一个更好的解决方案,因为你可以简单地对整个字段进行精确匹配:
for i in "${house[@]}"; do
awk -F'\t' -v i="$i" ' == i' metexplore_ID.tsv
done
不要忘记 ${house[@]}
两边的引号;否则像 L-GLUTAMIC ACID
这样的元素将被视为两个不同的词来匹配。
您还可以通过将 corrected_inhouse_list.txt
直接加载到 awk
数组来避免创建数组和循环:
awk -F'\t' -v i="$i" '
NR == FNR {houses[[=11=]]++; next}
in houses' corrected_inhouse_list.txt metexplore_ID.tsv
您可以考虑这个 awk
为列表中的每个条目构建正则表达式,然后在 csv 文件中任何地方 搜索该正则表达式:
awk '
NR == FNR {
kw[ "(^|[[:blank:]])" [=10=] "([[:blank:]]|$)" ]
next
}
{
for (w in kw)
if ( [=10=] ~ w ) {
print
next
}
}' corrected_inhouse_list.txt metexplore_ID.tsv
i8:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
21:M_glu_L exact multimatching 1 L-GLUTAMIC ACID CHEBI:16015 NA NA
40:M_trp_L exact multimatching 1 L-TRYPTOPHAN CHEBI:16828 NA NA
42:M_pro_L exact multimatching 1 L-PROLINE CHEBI:17203 NA NA
50:M_phe_L exact multimatching 1 L-PHENYLALANINE CHEBI:17295 NA NA
56:M_creat exact multimatching 1 CREATINE CHEBI:16919 NA NA
61:M_tyr_L exact multimatching 1 L-TYROSINE CHEBI:17895 NA NA
63:M_val_L exact multimatching 1 L-VALINE CHEBI:16414 NA NA
94:M_Lkynr exact multimatching 1 L-KYNURENINE CHEBI:16946 NA NA
191:M_pnto_R exact multimatching 1 D-PANTOTHENIC ACID CHEBI:7916 NANA
211:M_pcreat exact multimatching 1 CREATINE PHOSPHATE CHEBI:17287 NANA
315:M_ttdca exact multimatching 1 MYRISTIC ACID CHEBI:28875