我如何可视化两个 columns/lists 的三元组以查看两个 columns/lists 中是否出现相同的单词组合?
How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?
所以我有两个三元组列表(每个 20 个单词组合)例如
l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...
l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...
现在我想在一张图表(也许是配对图)中可视化这两个列表,看看是否有相似之处(所有 3 个词必须相同)。
提前致谢
INPUT:
l1 = [('hello', 'its', 'me'), ('I', 'need', 'help') ...]
l2 = [('I', 'need', 'help'), ('What', 'is', 'this') ...]
OUTPUT:
sim = [[('hello', 'its', 'me'), 1], [('I', 'need', 'help'), 2], [('What', 'is', 'this'), 1]]
merged = l1 + l2
unique = set(merged)
results = []
for tri in unique:
results.append([tri, merged.count(tri)])
从您的描述来看,这似乎就是您要找的东西。如果需要任何调整,请告诉我。
Larry the Llama 给出的答案似乎错过了“查看是否有相似之处”,因为该解决方案使用 set() 来删除所有重复项。
如果您希望通过完整的迭代来找到完全相似的三元组:
merged = l1 + l2
results_counter = {}
# Iterate all the trigrams
for index, trigram in enumerate(merged):
# Iterate all the trigrams which lay after in the array
for second_index in range(index, len(merged)):
all_same = True
# Find all of which are the same as the comparing trigram
for word_index, word in enumerate(trigram):
if merged[second_index][word_index] == trigram[word_index:
all_same = False
break
# If trigram was not found in the results_counter add the key else returning the value
previous_found = results_counter.setDefault(str(trigram), 0)
# Add one
previous_found[str(trigram)] += 1
# Will print the keys and the
for key in previous_found.keys():
# Print the count for each trigram
print(key, previous_found[key])
澄清后编辑:
import seaborn as sns
import pandas as pd
d1 = [("hello", "its", "me"), ("dont", "its", "me")]
d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]
word_to_number = {}
number_to_word = {} # if you want to show the sentence again
def one_hot(l):
"""
This function one hot encodes (converts each appearens of a word
to a number) and returns the encoded list while also adding the
keys to converter dictionaries for reverse converting.
"""
one_hot_encoded = []
for trigram in l:
encoded_trigram = []
for word in trigram:
# Add encoding of the word
encoded_word = word_to_number.setdefault(word, len(word_to_number))
number_to_word[encoded_word] = word
# Add to the one hot encoded trigram = {}
encoded_trigram.append(encoded_word)
# Add to the list which is sent in
one_hot_encoded.append(encoded_trigram)
return one_hot_encoded
d1 = one_hot(d1)
d2 = one_hot(d2)
data = {}
for ind, trigram in enumerate(d1 + d2):
# This will add each word to be compared
data["t" + str(ind)] = trigram
frame = pd.DataFrame.from_dict(data)
print(frame)
plot = sns.pairplot(frame)
# Make it clear
plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))
import matplotlib.pyplot as plt
plt.show()
这篇文章将为您提供三元组的配对图,尽管它不是很直观,因为您必须寻找精确的线性值。您可以使用它,但请确保您没有太多不同的词,因为这会扭曲轴并使视觉上很难看到结果。
所以我有两个三元组列表(每个 20 个单词组合)例如
l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...
l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...
现在我想在一张图表(也许是配对图)中可视化这两个列表,看看是否有相似之处(所有 3 个词必须相同)。
提前致谢
INPUT:
l1 = [('hello', 'its', 'me'), ('I', 'need', 'help') ...]
l2 = [('I', 'need', 'help'), ('What', 'is', 'this') ...]
OUTPUT:
sim = [[('hello', 'its', 'me'), 1], [('I', 'need', 'help'), 2], [('What', 'is', 'this'), 1]]
merged = l1 + l2
unique = set(merged)
results = []
for tri in unique:
results.append([tri, merged.count(tri)])
从您的描述来看,这似乎就是您要找的东西。如果需要任何调整,请告诉我。
Larry the Llama 给出的答案似乎错过了“查看是否有相似之处”,因为该解决方案使用 set() 来删除所有重复项。
如果您希望通过完整的迭代来找到完全相似的三元组:
merged = l1 + l2
results_counter = {}
# Iterate all the trigrams
for index, trigram in enumerate(merged):
# Iterate all the trigrams which lay after in the array
for second_index in range(index, len(merged)):
all_same = True
# Find all of which are the same as the comparing trigram
for word_index, word in enumerate(trigram):
if merged[second_index][word_index] == trigram[word_index:
all_same = False
break
# If trigram was not found in the results_counter add the key else returning the value
previous_found = results_counter.setDefault(str(trigram), 0)
# Add one
previous_found[str(trigram)] += 1
# Will print the keys and the
for key in previous_found.keys():
# Print the count for each trigram
print(key, previous_found[key])
澄清后编辑:
import seaborn as sns
import pandas as pd
d1 = [("hello", "its", "me"), ("dont", "its", "me")]
d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]
word_to_number = {}
number_to_word = {} # if you want to show the sentence again
def one_hot(l):
"""
This function one hot encodes (converts each appearens of a word
to a number) and returns the encoded list while also adding the
keys to converter dictionaries for reverse converting.
"""
one_hot_encoded = []
for trigram in l:
encoded_trigram = []
for word in trigram:
# Add encoding of the word
encoded_word = word_to_number.setdefault(word, len(word_to_number))
number_to_word[encoded_word] = word
# Add to the one hot encoded trigram = {}
encoded_trigram.append(encoded_word)
# Add to the list which is sent in
one_hot_encoded.append(encoded_trigram)
return one_hot_encoded
d1 = one_hot(d1)
d2 = one_hot(d2)
data = {}
for ind, trigram in enumerate(d1 + d2):
# This will add each word to be compared
data["t" + str(ind)] = trigram
frame = pd.DataFrame.from_dict(data)
print(frame)
plot = sns.pairplot(frame)
# Make it clear
plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))
import matplotlib.pyplot as plt
plt.show()
这篇文章将为您提供三元组的配对图,尽管它不是很直观,因为您必须寻找精确的线性值。您可以使用它,但请确保您没有太多不同的词,因为这会扭曲轴并使视觉上很难看到结果。