使用正则表达式将句子分成单词

Question

我想使用正则表达式将句子分解成单词，我正在使用此代码：

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
sentence = re.split('\s|,|>|<|\[|\]:', sentence)

但我等不及了

预期输出是：

['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

但我得到的是：

['', '30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', '', 'tester-test.service:', 'activation', 'successfully.']

我实际上尝试忽略空格，但实际上它应该只在最后一个长字中被忽略，我不知道我该怎么做.. 任何 suggestions/help 提前谢谢你

Answer 1

您可以使用

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
chunks = sentence.split(': ', 1)
result = re.findall(r'[^][\s,<>]+', chunks[0])
result.append(chunks[1])
print(result)
# => ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

见Python demo

这里，

chunks = sentence.split(': ', 1) - 用第一个 : 子串
result = re.findall(r'[^][\s,<>]+', chunks[0]) - 提取除 ]、[、空格、,、< 和 [= 以外的一个或多个字符组成的所有子字符串第一个块中的 18=] 个字符
result.append(chunks[1]) - 将第二个块附加到 result 列表。

Answer 2

从您的示例的“预期输出”来看，一旦遇到前面有 ': ' 的字符，该字符及其后的所有字符（到字符串末尾） ) 将被 return 编辑。我认为这是规则之一。

这向我表明你想要 return 匹配（而不是拆分的结果）并且要匹配的正则表达式应该是 two-part 交替（即，具有 ...|...) 的形式，第一部分是

(?<=: ).+

上面写着，“贪婪地匹配一个或多个字符，第一个字符前面有一个冒号，后面跟着一个 space”。 (?<=: ) 是 积极的回顾。

在到达第一个冒号后跟 space 的字符之前，我们需要匹配由数字、字母和连字符以及冒号前后跟一个数字组成的字符串。因此需要的正则表达式是

rgx = r'(?<=: ).+|(?:[\da-zA-Z-]|(?<=\d):(?=\d))+'

因此你可以写

str = "<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully."

re.findall(rgx, str)
  #=> ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd',
  #    '1', 'tester-test.service: activation successfully.']

Python demo^_<-_\(ツ)/^_->Regex demo

正则表达式的组成部分如下。

(?<=: )        # the preceding string must be ': '
.+             # match one or more characters (greedily)
|              # or
(?:            # begin a non-capture group
  [\da-zA-Z-]  # match one character in the character class
  |            # or
  (?<=\d)      # the previous character must be a digit
  :            # match literal
  (?=\d)       # the next character must be a digit
)+             # end the non-capture group and execute one or more times

(?=\d) 是 正向预测。

使用正则表达式将句子分成单词

divide sentence into words using regex

python

regex

parsing