预处理与单词列表不匹配的单词

Question

我有一个非常具体的案例要匹配：我有一些文本和一个单词列表（可能包含数字、下划线或 & 符号），我想清除数字字符的文本 (例如）除非它是我列表中的一个词。这个列表也足够长，我不能只制作一个匹配每个单词的正则表达式。

我已经尝试使用正则表达式来做到这一点（即按照 re.sub(r'\d+', '', text) 的方式做一些事情，但试图想出一个更复杂的正则表达式来匹配我的情况。这显然不完全工作，因为我不认为正则表达式是用来处理这种情况的。

我正在尝试使用 pyparsing 等其他选项进行试验，并尝试了类似下面的操作，但这也给了我一个错误（可能是因为我没有正确理解 pyparsing）：

from pyparsing import *
import re

phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)

处理这种匹配的最佳方法是什么，或者是否有其他更适合的库值得一试？

Answer 1

你已经非常接近这个 pyparsing cleaner-upper 的工作了。

解析操作通常将其匹配的标记作为类似列表的结构，一个 pyparsing 定义的 class 称为 ParseResults。

您可以通过将其包装在 pyparsing 装饰器中来查看实际发送到您的解析操作的内容 traceParseAction:

parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))

如果您将解析操作设为常规定义方法而不是 lambda，实际上会更容易阅读：

@traceParseAction
def unnumber(word):
    return re.sub(r'\d+', '', word)
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))

traceParseAction 将报告传递给解析操作的内容以及返回的内容。

>>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
<<leaving unnumber (exception: expected string or bytes-like object)

你可以看到传入的值是一个列表结构，所以你应该把你对re.sub的调用中的word替换为word[0]（我也修改了你的输入字符串向未受保护的单词添加一些数字，以查看正在执行的解析操作）：

text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"

def unnumber(word):
    return re.sub(r'\d+', '', word[0])

我得到：

['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']

此外，您还为解析器使用了“^”运算符。如果使用'|'，您可能会获得更好的性能运算符，因为“^”（它创建一个 Or 实例）将评估所有路径并选择最长的 - 在备选方案可能匹配的内容存在歧义的情况下是必要的。 '|'创建一个 MatchFirst 实例，一旦找到匹配项就会停止，并且不会进一步寻找任何替代方案。由于您的第一个选择是保护词列表，因此“|”实际上更合适 - 如果有人匹配，请不要再看。

预处理与单词列表不匹配的单词

Preprocess words that do not match list of words

regex

parsing

nlp

pyparsing