awk 删除以模式结尾的单词的结尾

Question

我有一个很大的数据集，我正在尝试用 awk 对一列 ($14) 进行词形还原，如果需要的话，我需要删除 'ing'、'ed'、's'以其中一种模式结束。所以问啊问啊，问了也就是'ask'

假设我有这个数据集（我要修改的列是 $2:

onething      This is a string that is tested multiple times.
twoed         I wanted to remove words ending with many patterns.
threes        Reading books is good thing.

这样，预期输出为：

onething      Thi i a str that i test multiple time.
twoed         I want to remove word end with many pattern.
threes        Read book i good th.

我试过用 awk 跟随正则表达式，但没用。

awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",); print}' file.txt  

#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)

请帮忙，我是 awk 的新手，仍在探索它。

Answer 1

将 GNU awk 用于 gensub() 和 \> 用于单词边界：

$ awk 'BEGIN{FS=OFS="\t"} {=gensub(/(ing|ed|s)\>/,"","g",)} 1' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

Answer 2

将任何 awk 与 gsub 结合使用，您可以：

awk -F'\t' -v OFS="\t" '
    { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
      match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
    }1
' file

示例输入文件

$ cat file
onething        This is a string that is tested multiple times.
twoed   I wanted to remove words ending with many patterns.
threes  Reading books is good thing.
four    Just a normal sentence.

示例Use/Output

$ awk -F'\t' -v OFS="\t" '
>     { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
>       match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
>     }1
> ' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.
four    Just a normal sentence.

(注:最后一行作为句子的例子添加不变)

Answer 3

如果你使用 GNU awk，你离它不远了：

$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",); print}' file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

注意 -v OFS='\t' 将制表符也用作输出字段分隔符。

但是如果你的 awk 使用那种没有单词边界的过时正则表达式（比如 macOS 自带的默认 awk），事情就更复杂了。一种选择是迭代使用 match 和 substr。示例：

# foo.awk
BEGIN {
  n = split(prefix, word, /,/)
  for(i = 1; i <= n; i++) {
    len[i] = length(word[i])
  }
}
{
  for(i = 1; i <= n; i++) {
    re = word[i] "[^[:alnum:]]"
    while(m = match(, re)) {
      if(m == 1) {
         = substr(, len[i]+1, length())
      } else {
       = substr(, 1, m-1) substr(, m+len[i], length())
      }
    }
  }
  print
}

然后：

$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

awk 删除以模式结尾的单词的结尾

awk remove endings of words ending with patterns

awk

gsub