Bash：从格式为 table 的输出中提取单元格

Question

我正在使用 TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) 从文本中提取名词。我的问题是输出格式如下：

word    pos     lemma

The     DT      the 
TreeTagger      NP      TreeTagger 
is      VBZ     be 
easy    JJ      easy 
to      TO      to 
use     VB      use

显然没有仅获取名词的选项（"NP" 和 "NN"）。使用 bash，如何获取第一列中第二列中具有 "NP" 或 "NN" 的单元格？

Answer 1

您可以为此使用 awk：

awk ' ~ /^N[PN]$/{print }' file

TreeTagger

正则表达式 /^N[PN]$/ 将匹配 NP 或 NN

正如@Cyrus 在下面正确评论的那样，您可以在正则表达式中使用交替作为：

awk ' ~ /^(NP|NN)$/ {print }' file

Bash：从格式为 table 的输出中提取单元格

Bash: Extract cells from output formatted as table

bash

treetagger