使用 numpy delete 和 enumerate 时索引错误

Question

Python 3.9

我有一个 numpy ndarray 字符串。实际数组有数千个字符串，但假设：

words_master = ['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

我正在尝试创建一个函数，该函数 return 是一个列表，其中包含给定字符的字符串已被删除。这用作 while 循环和 if 语句：

                index = 0
                temp = []
                while index != len(words_master):
                    idx = words_master[index]
                    if 'A' in idx:
                        temp.append(index)
                    index += 1
                words_master = np.delete(words_master, temp)

因为这仍然是一个 for 循环和 if 语句，我想知道是否可以使用列表理解来提高效率。

我对此的最佳猜测是：

words_master = np.delete(words_master, np.argwhere([x for x, item in enumerate(words_master) if 'A' in item]))

这里的逻辑是 np.delete 将获取初始数组，然后删除 np.argwhere 设置的索引处的所有项目。但是，它给出了这个输出：

['CARES' 'BORES' 'MARES']

好像忽略了第一个和最后一个元素？

其他奇怪之处：如果我在项目中使用 'CARES'，它会 return 列表而不做任何更改：

['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

如果我使用任何其他参数（'MARES' 或 'M' 或 'O'），似乎 return 没有第一个词的完整列表：

['BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES' 'MARES']

我试过了：

使用索引，例如使用 (reversed(list(enumerate.. 或使索引列表为 -1。但是，这些会导致相同类型的模式，但只是移位了。
改用 np.where()，但我遇到了类似的问题。

我想知道是否有一种干净的方法来解决这个问题？还是 while loop/if 语句是最好的选择？

编辑：对于“为什么不使用列表”这个问题，我读到 numpy 数组比 python 列表快很多，当我测试相同的 for 循环时，除了使用 python 使用 remove() 函数列出，在较大的数据集上速度慢 10 倍。

Answer 1

import numpy as np

words_master = np.array(['CARES', 'BARES', 'CANES', 'TARES', 'PARES', 'BANES', 'BALES', 'CORES', 'BORES', 'MARES']

是的。这可以更清楚地写成布尔索引的列表理解。

bad_char = "A"
words_without_char = words_master[[bad_char not in x for x in words_master]]

>>> words_without_char
array(['CORES', 'BORES'], dtype='<U5')

也可以直接列一个列表：

>>> [x for x in words_master if bad_char not in x]
['CORES', 'BORES']

Answer 2

你试过字符串方法吗？

filtered_words_master = [x for x in words_master if x.find('A') != 1]
Something like this?

编辑尝试解决有关数组与列表的问题：

def filtering_arrays(arr, substring):
  """ Remove elements containing specific substring """
  return np.delete(arr, [i for i, item in enumerate(arr) if item.find(substring) == 1])

Answer 3

argwhere returns 索引，其中 enumerate 为非零。这不是你想要的。

In [241]: [x for x, item in enumerate(words_master) if 'A' in item]
Out[241]: [0, 1, 2, 3, 4, 5, 6, 9]
In [242]: np.argwhere(_)
Out[242]: 
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7]])

没有它，enumerate 工作正常：

In [247]: np.delete(words_master, [x for x, item in enumerate(words_master) if
     ...: 'A' in item])
Out[247]: array(['CORES', 'BORES'], dtype='<U5')

但是比较它的时间和纯领悟：

In [248]: timeit np.delete(words_master, [x for x, item in enumerate(words_master) if 'A' in item])
27.8 µs ± 930 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [249]: timeit [word for word in words_master if word.find('A')==-1]
1.73 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [251]: timeit [word for word in words_master if 'A' not in word]
604 ns ± 2.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

delete次的枚举部分和其他理解差不多。所以大部分时间在[248]是delete。虽然是一个数组函数，但它并不是非常快。它可能比理解式更好地扩展，但我们仍然没有摆脱那些。

In [252]: timeit [x for x, item in enumerate(words_master) if 'A' in item]
1.06 µs ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

如果我们从字符串数组（而不是列表）开始，直接索引它会更快，而不是通过 delete:

In [279]: arr = np.array(words_master)
In [280]: arr[['A' not in word for word in arr]]
Out[280]: array(['CORES', 'BORES'], dtype='<U5')
In [281]: timeit arr[['A' not in word for word in arr]]
12.9 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

但是我们可以通过同时使用数组和列表（用于迭代）来改进它：

In [282]: timeit arr[['A' not in word for word in words_master]]
6.27 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Answer 4

你问的是 numpy，这是一个 numpy 的单行解决方案：

import numpy as np
words_master = np.array(['CARES','BARES','CANES','TARES','PARES','BANES','BALES','CORES','BORES','MARES'])

words_without_char=words_master[np.char.find(words_master,"A")==-1]

如果 find 命令没有找到该字符，它 returns -1，并且只返回那些项目

使用 numpy delete 和 enumerate 时索引错误

Errors with indexing when using numpy delete and enumerate

python

numpy

list-comprehension

numpy-ndarray