如何在分隔符上拆分字符串但排除其他字符串

How do I split string on delimiters but exclude other strings

我有这个字符串,我想按句点拆分:

j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'

这是我想要的结果:

['you can get it cheaper than .99. ', 'shop at amazon.com.', ' hurry before prices go up.']

我拆分每个小写字母,前面有句点,后面有句点和空格的任何数字。

x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
        if sentences[position] in sentence_endings:
            x.append(sentences[position -1] + sentences[position])

打印 x 给我:

['you can get it cheaper than .99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']

我希望 "amazon.com" 是一个字符串,所以我指示正则表达式忽略 re.split(r'([a-z]\.|\d\.\s)[^.com]', j) 的“.com” 但这并没有让我得到我想要的结果。执行此操作的最佳方法是什么?

Non-regex 选项可以使用 nltk.sent_tokenize():

>>> import nltk
>>> j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than .99.', 'shop at amazon.com.', 'hurry before prices go up.']

一个简单的正则表达式分割句点后跟 space 可以是 \.\s.

您可以使用回顾来保留拆分中的周期:(?<=\.)\s

如果你想使用拆分方法从你的字符串中得到 "amazon.com",你可以尝试 .*(?=amazon.com)|(?<=amazon.com).*