如何在解析 python 字符串时保持重复的标点符号？

Question

我需要处理少量文本（即 python 中的字符串）。

我想删除某些标点符号（比如 '.', ',', ':', ';', ）

但要保留表示情绪的标点符号，例如 ('...', '?', '??','???', '!', '!!', '!!!')

此外，我想删除非信息性词 'a', 'an', 'the' 。此外，目前最大的挑战是如何解析 "I've" 或 "we've" 最终得到 "I have" 和 "we have"？撇号让我很难。

best/simplest 在 python 中执行此操作的方法是什么？

例如：

"I've got an A mark!!! Such a relief... I should've partied more."

我想要得到的结果：

['I', 'have', 'got', 'A', 'mark', '!!!', 'Such', 'relief', '...', 

'I',  'should', 'have', 'partied', 'more']

Answer 1

这可能会变得复杂，具体取决于您要应用多少规则。

您可以在正则表达式中使用 \b 来匹配单词的开头或结尾。有了这个，您还可以隔离标点符号并检查它们是否是列表中的单个字符，如 [.;:]。

这段代码中使用了这些想法：

import re

def tokenise(txt):
    # Expand "'ve"
    txt = re.sub(r"(?i)(\w)'ve\b", r' have', txt)
    # Separate punctuation from words
    txt = re.sub(r'\b', ' ', txt)
    # Remove isolated, single-character punctuation,
    # and articles (a, an, the)
    txt = re.sub(r'(^|\s)([.;:]|[Aa]n|a|[Tt]he)($|\s)', r'', txt)    
    # Split into non-empty strings
    return filter(bool, re.split(r'\s+', txt))

# Example use
txt = "I've got an A mark!!! Such a relief... I should've partied more."
words = tokenise(txt)
print (','.join(words))

输出：

I,have,got,A,mark,!!!,Such,relief,...,I,should,have,partied,more

在 eval.in

上查看运行

如何在解析 python 字符串时保持重复的标点符号？

How to keep repetative punctuation while parsing python string?

regex

parsing

text

punctuation

python-2.7