我可以将正则表达式 re.sub() 与 numpy 数组或字符串列表一起使用吗?
Can I use regular expressions re.sub() with a numpy array or list of strings?
我有一个 numpy 条目数组 dtype=string_
。我想使用正则表达式 re
模块来替换所有多余的空格,\t
制表符,\n
制表符。
如果我使用单个字符串,我会使用 re.sub()
,如下所示:
import re
proust = 'If a little dreaming is dangerous, \t the cure for it is not to dream less but to dream more,. \t\t'
newstring = re.sub(r"\s+", " ", proust)
哪个returns
'If a little dreaming is dangerous, the cure for it is not to dream less but to dream more. '
要在 numpy 数组的每个条目中执行此操作,我应该以某种方式使用 for 循环。
类似于 for i in numpy_arr:
,但我不确定在将 re.sub()
应用于每个 numpy 数组元素时应遵循此 soc。
解决这个问题最明智的方法是什么?
编辑:
我原来的 numpy 数组或列表是一个 LONG list/array 条目,每个条目一个句子,如上。下面是五个条目的示例:
original_list = [ 'to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question']
这不完全是你的re.sub
,但效果是一样的,甚至更好:
In [109]: oarray
Out[109]:
array(['to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question'],
dtype='<U55')
In [110]: np.char.join(' ',np.char.split(oarray))Out[110]:
array(['to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question'],
dtype='<U39')
它在这种情况下有效,因为 split()
识别与“\s+”相同的空白字符集。
np.char.replace
将替换选定的字符,但必须多次应用它才能删除 '\n',然后删除 '\t' 等。还有一个 translate
.
我有一个 numpy 条目数组 dtype=string_
。我想使用正则表达式 re
模块来替换所有多余的空格,\t
制表符,\n
制表符。
如果我使用单个字符串,我会使用 re.sub()
,如下所示:
import re
proust = 'If a little dreaming is dangerous, \t the cure for it is not to dream less but to dream more,. \t\t'
newstring = re.sub(r"\s+", " ", proust)
哪个returns
'If a little dreaming is dangerous, the cure for it is not to dream less but to dream more. '
要在 numpy 数组的每个条目中执行此操作,我应该以某种方式使用 for 循环。
类似于 for i in numpy_arr:
,但我不确定在将 re.sub()
应用于每个 numpy 数组元素时应遵循此 soc。
解决这个问题最明智的方法是什么?
编辑:
我原来的 numpy 数组或列表是一个 LONG list/array 条目,每个条目一个句子,如上。下面是五个条目的示例:
original_list = [ 'to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question']
这不完全是你的re.sub
,但效果是一样的,甚至更好:
In [109]: oarray
Out[109]:
array(['to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question'],
dtype='<U55')
In [110]: np.char.join(' ',np.char.split(oarray))Out[110]:
array(['to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question'],
dtype='<U39')
它在这种情况下有效,因为 split()
识别与“\s+”相同的空白字符集。
np.char.replace
将替换选定的字符,但必须多次应用它才能删除 '\n',然后删除 '\t' 等。还有一个 translate
.