您可以从 numpy 数组或 pandas 数据帧中提取超过阈值的数据索引吗

Question

我正在使用以下内容相互比较几个字符串。这是我能够设计出的最快的方法，但它会产生一个非常大的二维数组。我可以看看，看看我想要什么。理想情况下，我想设置一个阈值并为超过该数字的每个值拉出索引。使事情变得更复杂的是，我不希望索引将字符串与自身进行比较，并且字符串可能在其他地方重复，所以我想知道是否是这种情况，所以我不能只忽略 1。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = sql.get_corpus()

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
similarity = cosine_similarity(vectors)

sql.get_corups() returns 字符串列表，目前有 1600 个左右的字符串。

我想要的有可能吗？我已经尝试使用 Levenshtein 将 1.4M 组合中的每一个组合相互比较，这很有效，但它需要 2.5 小时而不是上面的一半。我也尝试过 vecotrs with spacy，这需要几天时间。

Answer 1

我不确定我是否正确阅读了您的 post，但我相信这应该可以帮助您入门：

import numpy as np

# randomly distributed data we want to filter
data = np.random.rand(5, 5)

# get index of all values above a threshold
threshold = 0.5
above_threshold = data > threshold

# I am assuming your matrix has all string comparisons to
# itself on the diagonal
not_ident = np.identity(5) == 0.

# [edit: to prevent duplicate comparisons, use this instead of not_ident]
#upper_only = np.triu(np.ones((5,5)) - np.identity(5))

# 2D array, True when criteria met
result = above_threshold * not_ident
print(result)

# original shape, but 0 in place of all values not matching above criteria
values_orig_shape = data * result
print(values_orig_shape)

# all values that meet criteria, as a 1D array
values = data[result]
print(values)

# indices of all values that meet criteria (in same order as values array)
indices = [index for index,value in np.ndenumerate(result) if value]
print(indices)

您可以从 numpy 数组或 pandas 数据帧中提取超过阈值的数据索引吗

Can you extract indexes of data over a threshold from numpy array or pandas dataframe

python

numpy

levenshtein-distance

scikit-learn

spacy