如何从 python 中的列表中删除相似的关键字？

Question

我正在尝试从关键字列表中删除相似的关键字：

keywords=['ps4 pro deals','ps4 pro deals London']

所以我只需要删除类似的“ps4 pro deals”。我尝试使用 Leveshtein 距离进行相似性检查的代码：

similar_tags = [] 
to_be_removed = []
for word1 in keywords:
    for word2 in keywords:
        if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
            to_be_removed.append(word1)

for word in to_be_removed:
    if word in keywords:
        keywords.remove(word)

此代码会删除两个关键字而不是相似的关键字。

Answer 1

考虑以下简单示例：

words = ['A','B']
for w1 in words:
    for w2 in words:
        print(w1,w2)

输出：

A A
A B
B A
B B

注意有A B和B A。如果 A B 确实满足条件，那么 B A 也满足（对于输入元素的 Levenshtein 距离顺序无关紧要），首先导致添加 A 以删除列表，第二个导致添加 B 删除列表，因此 A 和 B 都被删除。

您可以使用以下结构，其中 w2 总是 after w1 in words:

words = ['A','B','C']
for inx, w1 in enumerate(words, 1):
    for w2 in words[inx:]:
        print(w1,w2)

输出：

A B
A C
B C

解释：对于单词中的每个 w1，我都取了一段单词，其中包含超出它的元素。我使用 enumerate 来获取需要跳过多少元素的信息，然后切片以跳过它们。

Answer 2

similar_tags = filter(lambda x: [x for i in keywords if x in i and x < i], keywords)

Answer 3

从https://gist.github.com/alvations/a4a6e0cc24d2fd9aff86开始，有一个“近似字典匹配”算法的实现描述在 http://www.aclweb.org/anthology/C10-1096


    Input: V : collection of strings
    Input: x: query string
    Input: α: threshold for the similarity
    Output: Y: list of strings similar to the query
    
    1. X ← string to feature(x);
    2. Y ←[];
    3. for l ← min y(|X|, α) to max y(|X|, α) do
    4.     τ ← min overlap(|X|, l, α);
    5.     R ← overlapjoin(X, τ , V , l);
    6.     foreach r ∈ R do append r to Y;
    7. end
    8. return Y;

和代码：

from math import sqrt
from collections import defaultdict

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer


def get_V(y, n):
    """
    Extract a ngram feature matrix for string comparison for a specified n order. 
    
    To extract the vectorized feature matrix, we can use the DIY way of 
    (i) extracting ngrams, (ii) then creating a numpy array for the matrix, 
    (iii) then fit new queries to the matrix. But since scikit-learn already 
    have those functions optimized for speed, we should use them.  
    """
    ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(n-1, n), min_df=1)
    V = ngram_vectorizer.fit_transform(y)
    return V,  ngram_vectorizer


def get_V_by_size(V):
    """ 
    Get an "indexed" matrix V, where the key is the size and the values are the
    the indices of V that has the size corresponding to the key.
    
    This is a optimization trick. For this to scale, we might have to query a 
    proper dataframe/database if our V is very very large, otherwise, if it fits
    on RAM, then we can simply prune the sequential queries by restricting the
    size.
    """
    V_by_size = defaultdict(list)
    for i, y_vec in enumerate(V):
        size_y = np.sum(y_vec.toarray())
        V_by_size[size_y].append(i) 
    return V_by_size


def min_max_y(size_X, alpha):
    """
    Computes the size threshold of Y given the size of X and alpha.
    """
    return int(alpha * alpha * size_X), int(size_X / (alpha * alpha))


def overlapjoin(vec_x, tau, sub_V, l):
    """
    Find the no. of overlap ngrams between the query *vec_x* and the possible
    subset of V. 
    """
    for i, _y in sub_V:
        # Possibly this can be optimized by checking through only the 
        # non-zeros in the sparse matrix but there might be some caveats
        # when synchronizing non-zeros of *vec_x* and *_y*. 
        num_overlaps = sum([1 for x_fi, y_fi in zip(vec_x, _y) 
                            if x_fi&y_fi > 0 and x_fi == y_fi])
        if num_overlaps > tau:
            yield i 

    
def approx_dict_matching(x, y, V=None, vectorizer=None, V_by_size=None, 
                         n=3, alpha=0.7):
    """
    The "approximate dictionary matching" algorithm as described in 
    http://www.aclweb.org/anthology/C10-1096
    
    :param x: The query word.
    :type x: str
    :param y: A list of words in the vocabulary.
    :type y: str
    :param n: The order of ngrams, e.g. n=3 uses trigrams, n=2 uses bigrams
    :type n: int
    :param alpha: A parameter to specify length threshold of similar words.
    :type alpha: float
    """
    # Use for optimizing V such that we only instantiate V once outside of the
    # approx_dict_matching() function.
    if V == vectorizer == V_by_size == None:
        V, vectorizer = get_V(y, n)
        # Optimization trick:
        # Pre-compute a dictionary to index the size of the non-zero values in V.
        V_by_size = get_V_by_size(V)
        
    # Note that sklearn assumes a list of queries, thus the [x] .
    # Step 1. X ← string to feature(x);
    vec_x = vectorizer.transform([x]).toarray()[0]
    
    # print (V.todense()) # Show feature matrix V in human-readable format.
    # print (vectorizer.transform([x]).toarray()[0]) # Show vector for query string, x.

    # Find the size of X. 
    size_X = sum(vec_x) 
    # Get the min and max size of y, given the alpha tolerance paramter.
    min_y, max_y = min_max_y(size_X, alpha)

    # Step 2. Y ←[];
    output = set()
    # Step3. for l ← min y(|X|, α) to max y(|X|, α) do
    for l in range(min_y, max_y): 
        # Step 4: τ ← min overlap(|X|, l, α);
        tau = alpha * sqrt(size_X * l) 
        # A subset of V where the words are of size l. 
        sub_V_indices = V_by_size.get(l)
        if sub_V_indices:
            sub_V = [(i, V[i].toarray()[0]) for i in sub_V_indices]
            # Step 5: R ← overlapjoin(X, τ , V , l); 
            R = (list(overlapjoin(vec_x, tau, sub_V, l)))
            # Step 6: foreach r ∈ R do append r to Y;
            output.update(R)
            
    return set([y[i] for i in output])

并使用，可能是这样的：

kw1 = 'ps4 pro deals'
kw2_list = 'ps4 pro deals London'.split()

print (approx_dict_matching(kw1,kw2,n=3, alpha=0.1))

[输出]:

{'ps4', 'deals', 'pro'}

如何从 python 中的列表中删除相似的关键字？

How to remove similar keywords from a list in python?

python

nlp

python-3.x

levenshtein-distance