需要帮助删除标点符号并替换 nlp 任务的数字
Need help to remove punctuation and replace numbers for an nlp task
比如我有一个字符串:
sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
我想删除标点符号并将数字替换为“£”符号。
我试过这个,但是当我尝试 运行 两者时只能替换一个或另一个。
我的代码如下
import re
s =([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
s= [([re.sub(r'\d+','£', word) for word in s])]
s)
我认为问题可能出在方括号中??
谢谢!
使用您的输入和模式:
>>> ([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
['cracked 0 million', "she's resolutely, smitten ", "that's creative [r]", 'the market ( knowledge check : prices up!']
>>>
原因是因为 [!":$()[]
被视为一个字符组,而 \',]
是一个文字模式,即引擎正在寻找 ',]
。
用组中的右括号转义:
\]
>>> ([re.sub(r'[!":$()[\]\',]',' ', word) for word in sentence])
['cracked 300 million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
>>>
编辑:
如果您尝试将多个操作堆叠到一个列表理解中,请将您的操作放在一个函数中并调用该函数:
def process_word(word):
word = re.sub(r'[!":$()[\]\',]',' ', word)
word = re.sub(r'\d+','£', word)
return word
结果:
>>> [process_word(word) for word in sentence]
['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
抱歉,我没有看到您请求的第二部分,但您可以通过此获取编号和标点符号
sentence = ['cracked 0 million', 'she\'s resolutely, smitten ', 'that\'s creative [r]',
'the market ( knowledge check : prices up!']
def replaceDigitAndPunctuation(newSentence):
new_word = ""
for char in newSentence:
if char in string.digits:
new_word += "£"
elif char in string.punctuation:
pass
else:
new_word += char
return new_word
for i in range(len(sentence)):
sentence[i] = replaceAllDigitInString(sentence[i])
如果你想用 space 替换一些特定的标点符号和用 £
符号替换任何数字块,你可以使用
import re
rx = re.compile(r'''[][!":$()',]|(\d+)''')
sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
s = [rx.sub(lambda x: '£' if x.group(1) else ' ', word) for word in sentence]
print(s) # => ['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
参见Python demo。
注意 []
在字符 class 内:当 ]
位于开头时,不需要转义,而 [
则不必在字符 classes 内完全被转义。我还使用了 triple-quoted 字符串文字,因此您可以按原样使用 "
和 '
而无需额外转义。
所以,在这里,[][!":$()',]|(\d+)
匹配 ]
、[
、!
、"
、:
、$
, (
, )
, '
或 ,
或匹配并捕获到第 1 组一个或多个数字。如果第 1 组匹配,则替换为欧元符号,否则为 space.
比如我有一个字符串:
sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
我想删除标点符号并将数字替换为“£”符号。 我试过这个,但是当我尝试 运行 两者时只能替换一个或另一个。 我的代码如下
import re
s =([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
s= [([re.sub(r'\d+','£', word) for word in s])]
s)
我认为问题可能出在方括号中?? 谢谢!
使用您的输入和模式:
>>> ([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
['cracked 0 million', "she's resolutely, smitten ", "that's creative [r]", 'the market ( knowledge check : prices up!']
>>>
原因是因为 [!":$()[]
被视为一个字符组,而 \',]
是一个文字模式,即引擎正在寻找 ',]
。
用组中的右括号转义:
\]
>>> ([re.sub(r'[!":$()[\]\',]',' ', word) for word in sentence])
['cracked 300 million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
>>>
编辑: 如果您尝试将多个操作堆叠到一个列表理解中,请将您的操作放在一个函数中并调用该函数:
def process_word(word):
word = re.sub(r'[!":$()[\]\',]',' ', word)
word = re.sub(r'\d+','£', word)
return word
结果:
>>> [process_word(word) for word in sentence]
['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
抱歉,我没有看到您请求的第二部分,但您可以通过此获取编号和标点符号
sentence = ['cracked 0 million', 'she\'s resolutely, smitten ', 'that\'s creative [r]',
'the market ( knowledge check : prices up!']
def replaceDigitAndPunctuation(newSentence):
new_word = ""
for char in newSentence:
if char in string.digits:
new_word += "£"
elif char in string.punctuation:
pass
else:
new_word += char
return new_word
for i in range(len(sentence)):
sentence[i] = replaceAllDigitInString(sentence[i])
如果你想用 space 替换一些特定的标点符号和用 £
符号替换任何数字块,你可以使用
import re
rx = re.compile(r'''[][!":$()',]|(\d+)''')
sentence = ['cracked 0 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
s = [rx.sub(lambda x: '£' if x.group(1) else ' ', word) for word in sentence]
print(s) # => ['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
参见Python demo。
注意 []
在字符 class 内:当 ]
位于开头时,不需要转义,而 [
则不必在字符 classes 内完全被转义。我还使用了 triple-quoted 字符串文字,因此您可以按原样使用 "
和 '
而无需额外转义。
所以,在这里,[][!":$()',]|(\d+)
匹配 ]
、[
、!
、"
、:
、$
, (
, )
, '
或 ,
或匹配并捕获到第 1 组一个或多个数字。如果第 1 组匹配,则替换为欧元符号,否则为 space.