awk 删除以模式结尾的单词的结尾

awk remove endings of words ending with patterns

我有一个很大的数据集,我正在尝试用 awk 对一列 ($14) 进行词形还原,如果需要的话,我需要删除 'ing'、'ed'、's'以其中一种模式结束。所以问啊问啊,问了也就是'ask'

假设我有这个数据集(我要修改的列是 $2:

onething      This is a string that is tested multiple times.
twoed         I wanted to remove words ending with many patterns.
threes        Reading books is good thing.

这样,预期输出为:

onething      Thi i a str that i test multiple time.
twoed         I want to remove word end with many pattern.
threes        Read book i good th.

我试过用 awk 跟随正则表达式,但没用。

awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",); print}' file.txt  

#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)

请帮忙,我是 awk 的新手,仍在探索它。

将 GNU awk 用于 gensub()\> 用于单词边界:

$ awk 'BEGIN{FS=OFS="\t"} {=gensub(/(ing|ed|s)\>/,"","g",)} 1' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

将任何 awkgsub 结合使用,您可以:

awk -F'\t' -v OFS="\t" '
    { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
      match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
    }1
' file

示例输入文件

$ cat file
onething        This is a string that is tested multiple times.
twoed   I wanted to remove words ending with many patterns.
threes  Reading books is good thing.
four    Just a normal sentence.

示例Use/Output

$ awk -F'\t' -v OFS="\t" '
>     { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
>       match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
>     }1
> ' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.
four    Just a normal sentence.

(注:最后一行作为句子的例子添加不变)

如果你使用 GNU awk,你离它不远了:

$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",); print}' file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

注意 -v OFS='\t' 将制表符也用作输出字段分隔符。

但是如果你的 awk 使用那种没有单词边界的过时正则表达式(比如 macOS 自带的默认 awk),事情就更复杂了。一种选择是迭代使用 matchsubstr。示例:

# foo.awk
BEGIN {
  n = split(prefix, word, /,/)
  for(i = 1; i <= n; i++) {
    len[i] = length(word[i])
  }
}
{
  for(i = 1; i <= n; i++) {
    re = word[i] "[^[:alnum:]]"
    while(m = match(, re)) {
      if(m == 1) {
         = substr(, len[i]+1, length())
      } else {
       = substr(, 1, m-1) substr(, m+len[i], length())
      }
    }
  }
  print
}

然后:

$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.