需要帮助删除标点符号并替换 nlp 任务的数字

Need help to remove punctuation and replace numbers for an nlp task

比如我有一个字符串:

sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']

我想删除标点符号并将数字替换为“£”符号。 我试过这个,但是当我尝试 运行 两者时只能替换一个或另一个。 我的代码如下

import re
s =([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence]) 

s= [([re.sub(r'\d+','£', word) for word in s])]
s)

我认为问题可能出在方括号中?? 谢谢!

使用您的输入和模式:

>>> ([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence]) 
['cracked 0 million', "she's resolutely, smitten ", "that's creative [r]", 'the market ( knowledge check : prices up!']
>>> 

原因是因为 [!":$()[] 被视为一个字符组,而 \',] 是一个文字模式,即引擎正在寻找 ',]

用组中的右括号转义:

\]

>>> ([re.sub(r'[!":$()[\]\',]',' ', word) for word in sentence]) 
['cracked  300 million', 'she s resolutely  smitten ', 'that s creative  r ', 'the market   knowledge check   prices up ']
>>> 

编辑: 如果您尝试将多个操作堆叠到一个列表理解中,请将您的操作放在一个函数中并调用该函数:

def process_word(word):
  word = re.sub(r'[!":$()[\]\',]',' ', word)
  word = re.sub(r'\d+','£', word)
  return word

结果:

>>> [process_word(word) for word in sentence]
['cracked  £ million', 'she s resolutely  smitten ', 'that s creative  r ', 'the market   knowledge check   prices up ']

抱歉,我没有看到您请求的第二部分,但您可以通过此获取编号和标点符号

sentence = ['cracked 0 million', 'she\'s resolutely, smitten ', 'that\'s creative [r]',
            'the market ( knowledge check : prices up!']
def replaceDigitAndPunctuation(newSentence):
    new_word = ""
    for char in newSentence:
        if char in string.digits:
            new_word += "£"
        elif char in string.punctuation:
            pass
        else:
            new_word += char
    return new_word


for i in range(len(sentence)):
    sentence[i] = replaceAllDigitInString(sentence[i])

如果你想用 space 替换一些特定的标点符号和用 £ 符号替换任何数字块,你可以使用

import re
rx = re.compile(r'''[][!":$()',]|(\d+)''')
sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
s = [rx.sub(lambda x: '£' if x.group(1) else ' ', word) for word in sentence] 
print(s) # => ['cracked  £ million', 'she s resolutely  smitten ', 'that s creative  r ', 'the market   knowledge check   prices up ']

参见Python demo

注意 [] 在字符 class 内:当 ] 位于开头时,不需要转义,而 [ 则不必在字符 classes 内完全被转义。我还使用了 triple-quoted 字符串文字,因此您可以按原样使用 "' 而无需额外转义。

所以,在这里,[][!":$()',]|(\d+) 匹配 ][!":$ , (, ), ', 或匹配并捕获到第 1 组一个或多个数字。如果第 1 组匹配,则替换为欧元符号,否则为 space.