Return 如果单词列表中的每个单词都存在于以单词列表作为值的字典中,则为键
Return the key if the each word in a list of words exists in a dictionary having a list of words as value
我有一个独特的用例。我的主要要求是效率和速度。我有一个长度为 40,000 的单词列表和一个格式为 data: {id1: ['hi','how'],id2:['I','love]..}
且长度为 250,000 的字典。我已经在这里解决了很多关于 SO 的问题,但找不到一个可能有效的问题。
如何检查单词列表中的每个单词是否存在于每个词典的单词列表(值)中?通常,可以执行以下操作:
all_words = get_vocabulary(data)
index = {}
for word in all_words:
for doc, tokens in data.items():
if word in tokens :
''' do something with key and tokens'''
通过这样做,我可以检查单词是否存在并完成其余的工作。但是,我的字典和列表很大,这需要很长时间。
If I have to go over a dictionary over and over again, it clearly marks a problem as mentioned by @DeepSpace in this question
非常感谢您提供的任何帮助。
您可以从字典中创建索引以加快搜索速度。例如:
all_words = ["word1", "word2"]
dct = {
"id1": ["tis", "word1", "and", "word2"],
"id2": ["word3", "word4"],
"id3": ["word2", "only"],
}
# create index dictionary:
index_dct = {}
for k, v in dct.items():
for word in v:
index_dct.setdefault(word, []).append(k)
# index dictionary is:
# {
# "tis": ["id1"],
# "word1": ["id1"],
# "and": ["id1"],
# "word2": ["id1", "id3"],
# "word3": ["id2"],
# "word4": ["id2"],
# "only": ["id3"],
# }
# now the search:
for word in all_words:
if word in index_dct:
for doc in index_dct[word]:
print("Word: {} Doc: {} Tokens: {}".format(word, doc, dct[doc]))
打印:
Word: word1 Doc: id1 Tokens: ['tis', 'word1', 'and', 'word2']
Word: word2 Doc: id1 Tokens: ['tis', 'word1', 'and', 'word2']
Word: word2 Doc: id3 Tokens: ['word2', 'only']
我有一个独特的用例。我的主要要求是效率和速度。我有一个长度为 40,000 的单词列表和一个格式为 data: {id1: ['hi','how'],id2:['I','love]..}
且长度为 250,000 的字典。我已经在这里解决了很多关于 SO 的问题,但找不到一个可能有效的问题。
如何检查单词列表中的每个单词是否存在于每个词典的单词列表(值)中?通常,可以执行以下操作:
all_words = get_vocabulary(data)
index = {}
for word in all_words:
for doc, tokens in data.items():
if word in tokens :
''' do something with key and tokens'''
通过这样做,我可以检查单词是否存在并完成其余的工作。但是,我的字典和列表很大,这需要很长时间。
If I have to go over a dictionary over and over again, it clearly marks a problem as mentioned by @DeepSpace in this question
非常感谢您提供的任何帮助。
您可以从字典中创建索引以加快搜索速度。例如:
all_words = ["word1", "word2"]
dct = {
"id1": ["tis", "word1", "and", "word2"],
"id2": ["word3", "word4"],
"id3": ["word2", "only"],
}
# create index dictionary:
index_dct = {}
for k, v in dct.items():
for word in v:
index_dct.setdefault(word, []).append(k)
# index dictionary is:
# {
# "tis": ["id1"],
# "word1": ["id1"],
# "and": ["id1"],
# "word2": ["id1", "id3"],
# "word3": ["id2"],
# "word4": ["id2"],
# "only": ["id3"],
# }
# now the search:
for word in all_words:
if word in index_dct:
for doc in index_dct[word]:
print("Word: {} Doc: {} Tokens: {}".format(word, doc, dct[doc]))
打印:
Word: word1 Doc: id1 Tokens: ['tis', 'word1', 'and', 'word2']
Word: word2 Doc: id1 Tokens: ['tis', 'word1', 'and', 'word2']
Word: word2 Doc: id3 Tokens: ['word2', 'only']