如何在分隔符上拆分字符串但排除其他字符串
How do I split string on delimiters but exclude other strings
我有这个字符串,我想按句点拆分:
j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'
这是我想要的结果:
['you can get it cheaper than .99. ', 'shop at amazon.com.', ' hurry before prices go up.']
我拆分每个小写字母,前面有句点,后面有句点和空格的任何数字。
x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
if sentences[position] in sentence_endings:
x.append(sentences[position -1] + sentences[position])
打印 x 给我:
['you can get it cheaper than .99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']
我希望 "amazon.com" 是一个字符串,所以我指示正则表达式忽略 re.split(r'([a-z]\.|\d\.\s)[^.com]', j)
的“.com”
但这并没有让我得到我想要的结果。执行此操作的最佳方法是什么?
Non-regex 选项可以使用 nltk.sent_tokenize()
:
>>> import nltk
>>> j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than .99.', 'shop at amazon.com.', 'hurry before prices go up.']
一个简单的正则表达式分割句点后跟 space 可以是 \.\s
.
您可以使用回顾来保留拆分中的周期:(?<=\.)\s
如果你想使用拆分方法从你的字符串中得到 "amazon.com",你可以尝试 .*(?=amazon.com)|(?<=amazon.com).*
我有这个字符串,我想按句点拆分:
j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'
这是我想要的结果:
['you can get it cheaper than .99. ', 'shop at amazon.com.', ' hurry before prices go up.']
我拆分每个小写字母,前面有句点,后面有句点和空格的任何数字。
x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
if sentences[position] in sentence_endings:
x.append(sentences[position -1] + sentences[position])
打印 x 给我:
['you can get it cheaper than .99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']
我希望 "amazon.com" 是一个字符串,所以我指示正则表达式忽略 re.split(r'([a-z]\.|\d\.\s)[^.com]', j)
的“.com”
但这并没有让我得到我想要的结果。执行此操作的最佳方法是什么?
Non-regex 选项可以使用 nltk.sent_tokenize()
:
>>> import nltk
>>> j = 'you can get it cheaper than .99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than .99.', 'shop at amazon.com.', 'hurry before prices go up.']
一个简单的正则表达式分割句点后跟 space 可以是 \.\s
.
您可以使用回顾来保留拆分中的周期:(?<=\.)\s
如果你想使用拆分方法从你的字符串中得到 "amazon.com",你可以尝试 .*(?=amazon.com)|(?<=amazon.com).*