如何从 python 中的列表中删除相似的关键字?
How to remove similar keywords from a list in python?
我正在尝试从关键字列表中删除相似的关键字:
keywords=['ps4 pro deals','ps4 pro deals London']
所以我只需要删除类似的“ps4 pro deals”。我尝试使用 Leveshtein 距离进行相似性检查的代码:
similar_tags = []
to_be_removed = []
for word1 in keywords:
for word2 in keywords:
if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
to_be_removed.append(word1)
for word in to_be_removed:
if word in keywords:
keywords.remove(word)
此代码会删除两个关键字而不是相似的关键字。
考虑以下简单示例:
words = ['A','B']
for w1 in words:
for w2 in words:
print(w1,w2)
输出:
A A
A B
B A
B B
注意有A B
和B A
。如果 A B
确实满足条件,那么 B A
也满足(对于输入元素的 Levenshtein 距离顺序无关紧要),首先导致添加 A
以删除列表,第二个导致添加 B
删除列表,因此 A
和 B
都被删除。
您可以使用以下结构,其中 w2
总是 after w1
in words
:
words = ['A','B','C']
for inx, w1 in enumerate(words, 1):
for w2 in words[inx:]:
print(w1,w2)
输出:
A B
A C
B C
解释:对于单词中的每个 w1,我都取了一段单词,其中包含超出它的元素。我使用 enumerate
来获取需要跳过多少元素的信息,然后切片以跳过它们。
similar_tags = filter(lambda x: [x for i in keywords if x in i and x < i], keywords)
从https://gist.github.com/alvations/a4a6e0cc24d2fd9aff86开始,有一个“近似字典匹配”算法的实现描述在
http://www.aclweb.org/anthology/C10-1096
Input: V : collection of strings
Input: x: query string
Input: α: threshold for the similarity
Output: Y: list of strings similar to the query
1. X ← string to feature(x);
2. Y ←[];
3. for l ← min y(|X|, α) to max y(|X|, α) do
4. τ ← min overlap(|X|, l, α);
5. R ← overlapjoin(X, τ , V , l);
6. foreach r ∈ R do append r to Y;
7. end
8. return Y;
和代码:
from math import sqrt
from collections import defaultdict
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def get_V(y, n):
"""
Extract a ngram feature matrix for string comparison for a specified n order.
To extract the vectorized feature matrix, we can use the DIY way of
(i) extracting ngrams, (ii) then creating a numpy array for the matrix,
(iii) then fit new queries to the matrix. But since scikit-learn already
have those functions optimized for speed, we should use them.
"""
ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(n-1, n), min_df=1)
V = ngram_vectorizer.fit_transform(y)
return V, ngram_vectorizer
def get_V_by_size(V):
"""
Get an "indexed" matrix V, where the key is the size and the values are the
the indices of V that has the size corresponding to the key.
This is a optimization trick. For this to scale, we might have to query a
proper dataframe/database if our V is very very large, otherwise, if it fits
on RAM, then we can simply prune the sequential queries by restricting the
size.
"""
V_by_size = defaultdict(list)
for i, y_vec in enumerate(V):
size_y = np.sum(y_vec.toarray())
V_by_size[size_y].append(i)
return V_by_size
def min_max_y(size_X, alpha):
"""
Computes the size threshold of Y given the size of X and alpha.
"""
return int(alpha * alpha * size_X), int(size_X / (alpha * alpha))
def overlapjoin(vec_x, tau, sub_V, l):
"""
Find the no. of overlap ngrams between the query *vec_x* and the possible
subset of V.
"""
for i, _y in sub_V:
# Possibly this can be optimized by checking through only the
# non-zeros in the sparse matrix but there might be some caveats
# when synchronizing non-zeros of *vec_x* and *_y*.
num_overlaps = sum([1 for x_fi, y_fi in zip(vec_x, _y)
if x_fi&y_fi > 0 and x_fi == y_fi])
if num_overlaps > tau:
yield i
def approx_dict_matching(x, y, V=None, vectorizer=None, V_by_size=None,
n=3, alpha=0.7):
"""
The "approximate dictionary matching" algorithm as described in
http://www.aclweb.org/anthology/C10-1096
:param x: The query word.
:type x: str
:param y: A list of words in the vocabulary.
:type y: str
:param n: The order of ngrams, e.g. n=3 uses trigrams, n=2 uses bigrams
:type n: int
:param alpha: A parameter to specify length threshold of similar words.
:type alpha: float
"""
# Use for optimizing V such that we only instantiate V once outside of the
# approx_dict_matching() function.
if V == vectorizer == V_by_size == None:
V, vectorizer = get_V(y, n)
# Optimization trick:
# Pre-compute a dictionary to index the size of the non-zero values in V.
V_by_size = get_V_by_size(V)
# Note that sklearn assumes a list of queries, thus the [x] .
# Step 1. X ← string to feature(x);
vec_x = vectorizer.transform([x]).toarray()[0]
# print (V.todense()) # Show feature matrix V in human-readable format.
# print (vectorizer.transform([x]).toarray()[0]) # Show vector for query string, x.
# Find the size of X.
size_X = sum(vec_x)
# Get the min and max size of y, given the alpha tolerance paramter.
min_y, max_y = min_max_y(size_X, alpha)
# Step 2. Y ←[];
output = set()
# Step3. for l ← min y(|X|, α) to max y(|X|, α) do
for l in range(min_y, max_y):
# Step 4: τ ← min overlap(|X|, l, α);
tau = alpha * sqrt(size_X * l)
# A subset of V where the words are of size l.
sub_V_indices = V_by_size.get(l)
if sub_V_indices:
sub_V = [(i, V[i].toarray()[0]) for i in sub_V_indices]
# Step 5: R ← overlapjoin(X, τ , V , l);
R = (list(overlapjoin(vec_x, tau, sub_V, l)))
# Step 6: foreach r ∈ R do append r to Y;
output.update(R)
return set([y[i] for i in output])
并使用,可能是这样的:
kw1 = 'ps4 pro deals'
kw2_list = 'ps4 pro deals London'.split()
print (approx_dict_matching(kw1,kw2,n=3, alpha=0.1))
[输出]:
{'ps4', 'deals', 'pro'}
我正在尝试从关键字列表中删除相似的关键字:
keywords=['ps4 pro deals','ps4 pro deals London']
所以我只需要删除类似的“ps4 pro deals”。我尝试使用 Leveshtein 距离进行相似性检查的代码:
similar_tags = []
to_be_removed = []
for word1 in keywords:
for word2 in keywords:
if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
to_be_removed.append(word1)
for word in to_be_removed:
if word in keywords:
keywords.remove(word)
此代码会删除两个关键字而不是相似的关键字。
考虑以下简单示例:
words = ['A','B']
for w1 in words:
for w2 in words:
print(w1,w2)
输出:
A A
A B
B A
B B
注意有A B
和B A
。如果 A B
确实满足条件,那么 B A
也满足(对于输入元素的 Levenshtein 距离顺序无关紧要),首先导致添加 A
以删除列表,第二个导致添加 B
删除列表,因此 A
和 B
都被删除。
您可以使用以下结构,其中 w2
总是 after w1
in words
:
words = ['A','B','C']
for inx, w1 in enumerate(words, 1):
for w2 in words[inx:]:
print(w1,w2)
输出:
A B
A C
B C
解释:对于单词中的每个 w1,我都取了一段单词,其中包含超出它的元素。我使用 enumerate
来获取需要跳过多少元素的信息,然后切片以跳过它们。
similar_tags = filter(lambda x: [x for i in keywords if x in i and x < i], keywords)
从https://gist.github.com/alvations/a4a6e0cc24d2fd9aff86开始,有一个“近似字典匹配”算法的实现描述在 http://www.aclweb.org/anthology/C10-1096
Input: V : collection of strings
Input: x: query string
Input: α: threshold for the similarity
Output: Y: list of strings similar to the query
1. X ← string to feature(x);
2. Y ←[];
3. for l ← min y(|X|, α) to max y(|X|, α) do
4. τ ← min overlap(|X|, l, α);
5. R ← overlapjoin(X, τ , V , l);
6. foreach r ∈ R do append r to Y;
7. end
8. return Y;
和代码:
from math import sqrt
from collections import defaultdict
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def get_V(y, n):
"""
Extract a ngram feature matrix for string comparison for a specified n order.
To extract the vectorized feature matrix, we can use the DIY way of
(i) extracting ngrams, (ii) then creating a numpy array for the matrix,
(iii) then fit new queries to the matrix. But since scikit-learn already
have those functions optimized for speed, we should use them.
"""
ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(n-1, n), min_df=1)
V = ngram_vectorizer.fit_transform(y)
return V, ngram_vectorizer
def get_V_by_size(V):
"""
Get an "indexed" matrix V, where the key is the size and the values are the
the indices of V that has the size corresponding to the key.
This is a optimization trick. For this to scale, we might have to query a
proper dataframe/database if our V is very very large, otherwise, if it fits
on RAM, then we can simply prune the sequential queries by restricting the
size.
"""
V_by_size = defaultdict(list)
for i, y_vec in enumerate(V):
size_y = np.sum(y_vec.toarray())
V_by_size[size_y].append(i)
return V_by_size
def min_max_y(size_X, alpha):
"""
Computes the size threshold of Y given the size of X and alpha.
"""
return int(alpha * alpha * size_X), int(size_X / (alpha * alpha))
def overlapjoin(vec_x, tau, sub_V, l):
"""
Find the no. of overlap ngrams between the query *vec_x* and the possible
subset of V.
"""
for i, _y in sub_V:
# Possibly this can be optimized by checking through only the
# non-zeros in the sparse matrix but there might be some caveats
# when synchronizing non-zeros of *vec_x* and *_y*.
num_overlaps = sum([1 for x_fi, y_fi in zip(vec_x, _y)
if x_fi&y_fi > 0 and x_fi == y_fi])
if num_overlaps > tau:
yield i
def approx_dict_matching(x, y, V=None, vectorizer=None, V_by_size=None,
n=3, alpha=0.7):
"""
The "approximate dictionary matching" algorithm as described in
http://www.aclweb.org/anthology/C10-1096
:param x: The query word.
:type x: str
:param y: A list of words in the vocabulary.
:type y: str
:param n: The order of ngrams, e.g. n=3 uses trigrams, n=2 uses bigrams
:type n: int
:param alpha: A parameter to specify length threshold of similar words.
:type alpha: float
"""
# Use for optimizing V such that we only instantiate V once outside of the
# approx_dict_matching() function.
if V == vectorizer == V_by_size == None:
V, vectorizer = get_V(y, n)
# Optimization trick:
# Pre-compute a dictionary to index the size of the non-zero values in V.
V_by_size = get_V_by_size(V)
# Note that sklearn assumes a list of queries, thus the [x] .
# Step 1. X ← string to feature(x);
vec_x = vectorizer.transform([x]).toarray()[0]
# print (V.todense()) # Show feature matrix V in human-readable format.
# print (vectorizer.transform([x]).toarray()[0]) # Show vector for query string, x.
# Find the size of X.
size_X = sum(vec_x)
# Get the min and max size of y, given the alpha tolerance paramter.
min_y, max_y = min_max_y(size_X, alpha)
# Step 2. Y ←[];
output = set()
# Step3. for l ← min y(|X|, α) to max y(|X|, α) do
for l in range(min_y, max_y):
# Step 4: τ ← min overlap(|X|, l, α);
tau = alpha * sqrt(size_X * l)
# A subset of V where the words are of size l.
sub_V_indices = V_by_size.get(l)
if sub_V_indices:
sub_V = [(i, V[i].toarray()[0]) for i in sub_V_indices]
# Step 5: R ← overlapjoin(X, τ , V , l);
R = (list(overlapjoin(vec_x, tau, sub_V, l)))
# Step 6: foreach r ∈ R do append r to Y;
output.update(R)
return set([y[i] for i in output])
并使用,可能是这样的:
kw1 = 'ps4 pro deals'
kw2_list = 'ps4 pro deals London'.split()
print (approx_dict_matching(kw1,kw2,n=3, alpha=0.1))
[输出]:
{'ps4', 'deals', 'pro'}