如何将单词和标点符号之间包含空格的字符串分隔成句子?
How can I separate a string that contains whitespace between words and punctuation into sentences?
我有以下字符串:
string = "Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . Did he mind ? Steve jones jr . thinks he didn't . In any case , this isn't true ... Well , with a probability of . 9 it isn't . What a great site ! I really loved it !!! Did you ???"
我需要把它分成这样的句子:
Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it .
Did he mind ?
Steve jones jr . thinks he didn't .
In any case , this isn't true ...
Well , with a probability of . 9 it isn't .
What a great site !
I really loved it !!!
Did you ???
并将它们保存到句子列表中。
我使用了以下代码:
sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s", input_doc2)
print (sents)
我得到的输出是:
['mr .', 'smith bought cheapsite .', 'com for 1 .', '5 million dollars , i .', 'e .', 'he paid a lot for it .', 'did he mind ?', 'adam jones jr .', "thinks he didn't .", "in any case , this isn't true ...", 'well , with a probability of .', "9 it isn't .", 'what a great movie !', 'i loved it .', 'i loved it !!!', 'did you ???', 'i did .!?', 'not really it was bad !', '']
这是错误的。似乎没有办法解决这个问题。有办法解决这个问题吗?
提前致谢。
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])
尝试 this.See 演示。
https://regex101.com/r/sH8aR8/3
sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])", input_doc2)
print (sents)
我有以下字符串:
string = "Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . Did he mind ? Steve jones jr . thinks he didn't . In any case , this isn't true ... Well , with a probability of . 9 it isn't . What a great site ! I really loved it !!! Did you ???"
我需要把它分成这样的句子:
Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it .
Did he mind ?
Steve jones jr . thinks he didn't .
In any case , this isn't true ...
Well , with a probability of . 9 it isn't .
What a great site !
I really loved it !!!
Did you ???
并将它们保存到句子列表中。
我使用了以下代码:
sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s", input_doc2)
print (sents)
我得到的输出是:
['mr .', 'smith bought cheapsite .', 'com for 1 .', '5 million dollars , i .', 'e .', 'he paid a lot for it .', 'did he mind ?', 'adam jones jr .', "thinks he didn't .", "in any case , this isn't true ...", 'well , with a probability of .', "9 it isn't .", 'what a great movie !', 'i loved it .', 'i loved it !!!', 'did you ???', 'i did .!?', 'not really it was bad !', '']
这是错误的。似乎没有办法解决这个问题。有办法解决这个问题吗?
提前致谢。
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])
尝试 this.See 演示。
https://regex101.com/r/sH8aR8/3
sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])", input_doc2)
print (sents)