Python:这是一种比较和排序字符串列表的低效方法吗?
Python: is this an inefficient way to compare and sort lists of strings?
我有两个字符串列表,A 和 B。对于 A 中的每个字符串,我想将它与 B 中的每个字符串进行比较,select 最相似的匹配。我使用的比较函数是自定义余弦相似性度量 I found on this question。它是这样工作的:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
我的问题是,如果我的列表有点长(500-1000 项),执行开始需要五到十分钟。这是一个使用一些虚拟文本的示例:
import requests
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))
现在我有两个列表,每个列表有大约 500 个字符串(每个字符串 100 个字符),我计算相似度并取最上面的一个。这是通过从 A 中获取一个字符串,遍历 B,按 cosine_sim 分数排序,然后获取最后一个元素,然后对 A 中的所有元素重复:
来完成的
matches = [(a, list(sorted([[b, cosine_sim(a, b)]
for b in B], key=lambda x: x[1]))[-1])
for a in A]
输出是一个匹配项列表,其中每个项目都包含字符串及其计算的相似性分数。不过,最后一行 运行 花了 7 分钟。我想知道我的过程中是否存在效率低下的问题,或者是否只有很多计算(500*500 = 250,000 次比较,加上最佳 500 次排序)?
最大的问题可能是您要为每对文档计算 tfidf(此处的文档仅表示您的文本单元 - 这可能是推文、句子、科学论文或书籍)。此外,如果它已经存在,您不应该自己编写相似性度量。最后,sklearn
有一个 pairwise_distance
例程,可以执行您想要的操作并进行了优化。将所有这些放在一起,这是一个示例脚本:
import requests
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text.split('\n\n') # I'm splitting the document by "paragraphs" since it is unclear what you actually want
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
doc_vectors = vectorizer.fit_transform(sample)
distances = pairwise_distances(doc_vectors, metric='cosine')
row_idx = list(enumerate(distances.argmax(axis=1)))
sorted_pairs = sorted(row_idx, key= lambda x: distances[x[0], x[1]], reverse=True)
# most similar documents:
i1, i2 = sorted_pairs[0] # index of most similar pair
print(sample[i1])
print("=="*50)
print(sample[i2])
我的 sample
列表中有 99 个文档,下载完成后 运行 几乎是瞬间完成的。另外,输出:
Art party taxidermy locavore 3 wolf moon occupy. Tote bag twee tacos
listicle, butcher single-origin coffee raclette gentrify raw denim
helvetica kale chips shaman williamsburg man braid. Poke normcore lomo
health goth waistcoat kogi. Af next level banh mi, deep v locavore
asymmetrical snackwave chillwave. Subway tile viral flexitarian pok
pok vegan, cardigan health goth venmo artisan. Iceland next level twee
adaptogen, dreamcatcher paleo lyft. Selfies shoreditch microdosing
vape, knausgaard hot chicken pitchfork typewriter polaroid lyft
skateboard ethical distillery. Farm-to-table blue bottle yr artisan
wolf try-hard vegan paleo knausgaard deep v salvia ugh offal
snackwave. Succulents taxidermy cornhole wayfarers butcher, street art
polaroid jean shorts williamsburg la croix tumblr raw denim. Hot
chicken health goth taiyaki truffaut pop-up iceland shoreditch
fingerstache.
============================================= ================================================ =====
Organic microdosing keytar thundercats chambray, cray raclette. Seitan
irony raclette chia, cornhole YOLO stumptown. Gluten-free palo santo
beard chia. Whatever bushwick stumptown seitan cred quinoa. Small
batch selfies portland, cardigan you probably haven't heard of them
shabby chic yr four dollar toast flexitarian palo santo beard offal
migas. Kinfolk pour-over glossier, hammock poutine pinterest coloring
book kitsch adaptogen wayfarers +1 tattooed lomo yuccie vice. Plaid
fixie portland, letterpress knausgaard sartorial live-edge. Austin
adaptogen YOLO cloud bread wayfarers cliche hammock banjo. Sustainable
organic air plant mustache.
我有两个字符串列表,A 和 B。对于 A 中的每个字符串,我想将它与 B 中的每个字符串进行比较,select 最相似的匹配。我使用的比较函数是自定义余弦相似性度量 I found on this question。它是这样工作的:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
我的问题是,如果我的列表有点长(500-1000 项),执行开始需要五到十分钟。这是一个使用一些虚拟文本的示例:
import requests
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))
现在我有两个列表,每个列表有大约 500 个字符串(每个字符串 100 个字符),我计算相似度并取最上面的一个。这是通过从 A 中获取一个字符串,遍历 B,按 cosine_sim 分数排序,然后获取最后一个元素,然后对 A 中的所有元素重复:
来完成的matches = [(a, list(sorted([[b, cosine_sim(a, b)]
for b in B], key=lambda x: x[1]))[-1])
for a in A]
输出是一个匹配项列表,其中每个项目都包含字符串及其计算的相似性分数。不过,最后一行 运行 花了 7 分钟。我想知道我的过程中是否存在效率低下的问题,或者是否只有很多计算(500*500 = 250,000 次比较,加上最佳 500 次排序)?
最大的问题可能是您要为每对文档计算 tfidf(此处的文档仅表示您的文本单元 - 这可能是推文、句子、科学论文或书籍)。此外,如果它已经存在,您不应该自己编写相似性度量。最后,sklearn
有一个 pairwise_distance
例程,可以执行您想要的操作并进行了优化。将所有这些放在一起,这是一个示例脚本:
import requests
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text.split('\n\n') # I'm splitting the document by "paragraphs" since it is unclear what you actually want
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
doc_vectors = vectorizer.fit_transform(sample)
distances = pairwise_distances(doc_vectors, metric='cosine')
row_idx = list(enumerate(distances.argmax(axis=1)))
sorted_pairs = sorted(row_idx, key= lambda x: distances[x[0], x[1]], reverse=True)
# most similar documents:
i1, i2 = sorted_pairs[0] # index of most similar pair
print(sample[i1])
print("=="*50)
print(sample[i2])
我的 sample
列表中有 99 个文档,下载完成后 运行 几乎是瞬间完成的。另外,输出:
Art party taxidermy locavore 3 wolf moon occupy. Tote bag twee tacos listicle, butcher single-origin coffee raclette gentrify raw denim helvetica kale chips shaman williamsburg man braid. Poke normcore lomo health goth waistcoat kogi. Af next level banh mi, deep v locavore asymmetrical snackwave chillwave. Subway tile viral flexitarian pok pok vegan, cardigan health goth venmo artisan. Iceland next level twee adaptogen, dreamcatcher paleo lyft. Selfies shoreditch microdosing vape, knausgaard hot chicken pitchfork typewriter polaroid lyft skateboard ethical distillery. Farm-to-table blue bottle yr artisan wolf try-hard vegan paleo knausgaard deep v salvia ugh offal snackwave. Succulents taxidermy cornhole wayfarers butcher, street art polaroid jean shorts williamsburg la croix tumblr raw denim. Hot chicken health goth taiyaki truffaut pop-up iceland shoreditch fingerstache.
============================================= ================================================ =====
Organic microdosing keytar thundercats chambray, cray raclette. Seitan irony raclette chia, cornhole YOLO stumptown. Gluten-free palo santo beard chia. Whatever bushwick stumptown seitan cred quinoa. Small batch selfies portland, cardigan you probably haven't heard of them shabby chic yr four dollar toast flexitarian palo santo beard offal migas. Kinfolk pour-over glossier, hammock poutine pinterest coloring book kitsch adaptogen wayfarers +1 tattooed lomo yuccie vice. Plaid fixie portland, letterpress knausgaard sartorial live-edge. Austin adaptogen YOLO cloud bread wayfarers cliche hammock banjo. Sustainable organic air plant mustache.