awk 删除以模式结尾的单词的结尾
awk remove endings of words ending with patterns
我有一个很大的数据集,我正在尝试用 awk 对一列 ($14) 进行词形还原,如果需要的话,我需要删除 'ing'、'ed'、's'以其中一种模式结束。所以问啊问啊,问了也就是'ask'
假设我有这个数据集(我要修改的列是 $2:
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
这样,预期输出为:
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
我试过用 awk 跟随正则表达式,但没用。
awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",); print}' file.txt
#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)
请帮忙,我是 awk 的新手,仍在探索它。
将 GNU awk 用于 gensub()
和 \>
用于单词边界:
$ awk 'BEGIN{FS=OFS="\t"} {=gensub(/(ing|ed|s)\>/,"","g",)} 1' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
将任何 awk
与 gsub
结合使用,您可以:
awk -F'\t' -v OFS="\t" '
{ gsub(/(s|ed|ing)[.[:blank:]]/," ",)
match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
}1
' file
示例输入文件
$ cat file
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
four Just a normal sentence.
示例Use/Output
$ awk -F'\t' -v OFS="\t" '
> { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
> match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
> }1
> ' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
four Just a normal sentence.
(注:最后一行作为句子的例子添加不变)
如果你使用 GNU awk,你离它不远了:
$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",); print}' file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
注意 -v OFS='\t'
将制表符也用作输出字段分隔符。
但是如果你的 awk 使用那种没有单词边界的过时正则表达式(比如 macOS 自带的默认 awk),事情就更复杂了。一种选择是迭代使用 match
和 substr
。示例:
# foo.awk
BEGIN {
n = split(prefix, word, /,/)
for(i = 1; i <= n; i++) {
len[i] = length(word[i])
}
}
{
for(i = 1; i <= n; i++) {
re = word[i] "[^[:alnum:]]"
while(m = match(, re)) {
if(m == 1) {
= substr(, len[i]+1, length())
} else {
= substr(, 1, m-1) substr(, m+len[i], length())
}
}
}
print
}
然后:
$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
我有一个很大的数据集,我正在尝试用 awk 对一列 ($14) 进行词形还原,如果需要的话,我需要删除 'ing'、'ed'、's'以其中一种模式结束。所以问啊问啊,问了也就是'ask'
假设我有这个数据集(我要修改的列是 $2:
onething This is a string that is tested multiple times. twoed I wanted to remove words ending with many patterns. threes Reading books is good thing.
这样,预期输出为:
onething Thi i a str that i test multiple time. twoed I want to remove word end with many pattern. threes Read book i good th.
我试过用 awk 跟随正则表达式,但没用。
awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",); print}' file.txt
#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)
请帮忙,我是 awk 的新手,仍在探索它。
将 GNU awk 用于 gensub()
和 \>
用于单词边界:
$ awk 'BEGIN{FS=OFS="\t"} {=gensub(/(ing|ed|s)\>/,"","g",)} 1' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
将任何 awk
与 gsub
结合使用,您可以:
awk -F'\t' -v OFS="\t" '
{ gsub(/(s|ed|ing)[.[:blank:]]/," ",)
match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
}1
' file
示例输入文件
$ cat file
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
four Just a normal sentence.
示例Use/Output
$ awk -F'\t' -v OFS="\t" '
> { gsub(/(s|ed|ing)[.[:blank:]]/," ",)
> match(,/[.]$/) || sub(/[[:blank:]]$/,".",)
> }1
> ' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
four Just a normal sentence.
(注:最后一行作为句子的例子添加不变)
如果你使用 GNU awk,你离它不远了:
$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",); print}' file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
注意 -v OFS='\t'
将制表符也用作输出字段分隔符。
但是如果你的 awk 使用那种没有单词边界的过时正则表达式(比如 macOS 自带的默认 awk),事情就更复杂了。一种选择是迭代使用 match
和 substr
。示例:
# foo.awk
BEGIN {
n = split(prefix, word, /,/)
for(i = 1; i <= n; i++) {
len[i] = length(word[i])
}
}
{
for(i = 1; i <= n; i++) {
re = word[i] "[^[:alnum:]]"
while(m = match(, re)) {
if(m == 1) {
= substr(, len[i]+1, length())
} else {
= substr(, 1, m-1) substr(, m+len[i], length())
}
}
}
print
}
然后:
$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.