我怎样才能找到所有具有这种模式的子字符串：some_word.some_other_word 和 python？

Question

我正在尝试清理一些非常嘈杂的用户生成的网络数据。有些人不会在句号后加上 space。例如，

"Place order.Call us if you have any questions."

我想提取每个句子，但是当我尝试使用 nltk 解析一个句子时，它无法识别这是两个单独的句子。我想使用正则表达式查找包含 "some_word.some_other_word" 的所有模式以及使用 python.

包含 "some_word:some_other_word" 的所有模式

同时我想避免找到像 "U.S.A" 这样的模式。所以避免 just_a_character.just_another_character

非常感谢您的帮助:)

Answer 1

你可以使用

import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

Answer 2

最简单的解决方案：

>>> import re
>>> re.sub(r'([.:])([^\s])', r' ', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'

第一个参数——模式——告诉我们要匹配一个句点或一个冒号后跟一个非白色space 字符。第二个参数是替换，它把第一个匹配的符号，然后是 space，然后是第二个匹配的符号。

Answer 3

看来你问的是两个不同的问题：

1) 如果你想找到像 "some_word.some_other_word" 或 "some_word:some_other_word"

这样的所有模式

import re
re.findall('\w+[\.:\?\!]\w+', your_text)

这将找到文本中的所有模式 your_text

2) 如果你想提取所有的句子，你可以这样做

import re
re.split('[\.\!\?]', your_text)

这应该是 return 一个句子列表。例如，

text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']

Answer 4

以下是您的文本中可能出现的一些情况：

sample = """
   Place order.Call us (period: split)  
   ever after.(The end) (period: split)  
   U.S.A.(abbreviation: don't split internally)
   1.3 How to work with computers (dotted numeral: don't split)  
   ever after...The end (ellipsis: don't split internally)
   (This is the end.)   (period inside parens: don't split)  
   """

因此：不要在数字后、单个大写字母后、括号或其他句点前添加 space。否则添加 space。这将完成所有这些：

sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r". ", sample)

结果：

Place order. Call us (period: split)  
ever after. (The end) (period: split)  
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)  
ever after... The end (ellipsis: don't split internally)
(This is the end.)   (period inside parens: don't split)

这解决了样本中的所有问题，除了 U.S.A. 之后的最后一个时期，应该在它后面添加一个 space。我把它放在一边，因为条件的组合很棘手。以下正则表达式将处理所有问题，但我不推荐它：

   sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r". ", sample)

像这样的复杂正则表达式是可维护性的噩梦——只需尝试添加另一个模式，或限制它以忽略更多情况。相反，我建议使用单独的正则表达式来捕获缺失的情况：在单个大写字母之后的句点，但后面没有单个大写字母、paren 或另一个句点。

sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r" ", sample)

对于像这样的复杂任务，为每种类型的替换使用单独的正则表达式是有意义的。我将原始文件拆分为子案例，每个子案例仅针对一个非常特定的模式添加 spaces。想吃多少就吃多少，而且不会失控（至少，不会太多...）

我怎样才能找到所有具有这种模式的子字符串：some_word.some_other_word 和 python？

How can I find all substrings that have this pattern: some_word.some_other_word with python?

python

regex

nltk