Grep 语法从句
Grep grammatical clauses
我正在尝试寻找一种方法来从电子书样本中提取语法从句。
输入如下所示:
This is a test my friend, this is just a test; I'm going to do some shopping:`what do you need?`
Nothing, he said.
期望的输出:
This is a test my friend
this is just a test
I'm going to do shopping
what do you need
Nothing
he said
关于如何实现这一目标的任何想法?
非常感谢!
通过管道传输到 tr.
cat input | tr ',' '\n'
你可以像这样使用 gnu-awk:
awk -v RS='[\n.,;:`?]+' -v ORS='\n' '{=} 1' file
This is a test my friend
this is just a test
I'm going to do some shopping
what do you need
Nothing
he said
这很接近:
grep -o '[[:alpha:][:space:]]\+' file
但它将 "I'm" 中的单引号转换为换行符。鉴于您的示例标点符号,这有效:
grep -o '[^,;:`?.]\+' file
这将在标点符号后保留 space。要删除它,请将输出通过管道传输到
| sed 's/^ //'
我正在尝试寻找一种方法来从电子书样本中提取语法从句。 输入如下所示:
This is a test my friend, this is just a test; I'm going to do some shopping:`what do you need?`
Nothing, he said.
期望的输出:
This is a test my friend
this is just a test
I'm going to do shopping
what do you need
Nothing
he said
关于如何实现这一目标的任何想法?
非常感谢!
通过管道传输到 tr.
cat input | tr ',' '\n'
你可以像这样使用 gnu-awk:
awk -v RS='[\n.,;:`?]+' -v ORS='\n' '{=} 1' file
This is a test my friend
this is just a test
I'm going to do some shopping
what do you need
Nothing
he said
这很接近:
grep -o '[[:alpha:][:space:]]\+' file
但它将 "I'm" 中的单引号转换为换行符。鉴于您的示例标点符号,这有效:
grep -o '[^,;:`?.]\+' file
这将在标点符号后保留 space。要删除它,请将输出通过管道传输到
| sed 's/^ //'