我如何可视化两个 columns/lists 的三元组以查看两个 columns/lists 中是否出现相同的单词组合？

Question

所以我有两个三元组列表（每个 20 个单词组合）例如

l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...

l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...

现在我想在一张图表（也许是配对图）中可视化这两个列表，看看是否有相似之处（所有 3 个词必须相同）。

提前致谢

Answer 1

INPUT:

l1 = [('hello', 'its', 'me'), ('I', 'need', 'help') ...]
l2 = [('I', 'need', 'help'), ('What', 'is', 'this') ...]

OUTPUT:

sim = [[('hello', 'its', 'me'), 1], [('I', 'need', 'help'), 2], [('What', 'is', 'this'), 1]]

merged = l1 + l2
unique = set(merged)
results = []

for tri in unique:
    results.append([tri, merged.count(tri)])

从您的描述来看，这似乎就是您要找的东西。如果需要任何调整，请告诉我。

Answer 2

Larry the Llama 给出的答案似乎错过了“查看是否有相似之处”，因为该解决方案使用 set() 来删除所有重复项。

如果您希望通过完整的迭代来找到完全相似的三元组：

merged = l1 + l2

results_counter = {}

# Iterate all the trigrams
for index, trigram in enumerate(merged):
    # Iterate all the trigrams which lay after in the array
    for second_index in range(index, len(merged)):
        all_same = True

        # Find all of which are the same as the comparing trigram
        for word_index, word in enumerate(trigram):
            if merged[second_index][word_index] == trigram[word_index:
                all_same = False
                break
        
        # If trigram was not found in the results_counter add the key else returning the value 
        previous_found = results_counter.setDefault(str(trigram), 0)
        # Add one
        previous_found[str(trigram)] += 1

# Will print the keys and the 
for key in previous_found.keys():
    # Print the count for each trigram
    print(key, previous_found[key])

澄清后编辑：

import seaborn as sns
import pandas as pd

d1 = [("hello", "its", "me"), ("dont", "its", "me")]
d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]

word_to_number = {} 
number_to_word = {} # if you want to show the sentence again
def one_hot(l):
    """
    This function one hot encodes (converts each appearens of a word
    to a number) and returns the encoded list while also adding the
    keys to converter dictionaries for reverse converting.
    """
    one_hot_encoded = []
    for trigram in l:
        encoded_trigram = []
        for word in trigram:
            # Add encoding of the word
            encoded_word = word_to_number.setdefault(word, len(word_to_number))
            number_to_word[encoded_word] = word
            # Add to the one hot encoded trigram = {} 
            encoded_trigram.append(encoded_word)
        
        # Add to the list which is sent in
        one_hot_encoded.append(encoded_trigram)

    return one_hot_encoded

d1 = one_hot(d1)
d2 = one_hot(d2)

data = {}
for ind, trigram in enumerate(d1 + d2):
    # This will add each word to be compared
    data["t" + str(ind)] = trigram

frame = pd.DataFrame.from_dict(data)
print(frame)

plot = sns.pairplot(frame)
# Make it clear
plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))

import matplotlib.pyplot as plt
plt.show()

这篇文章将为您提供三元组的配对图，尽管它不是很直观，因为您必须寻找精确的线性值。您可以使用它，但请确保您没有太多不同的词，因为这会扭曲轴并使视觉上很难看到结果。

我如何可视化两个 columns/lists 的三元组以查看两个 columns/lists 中是否出现相同的单词组合？

How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?

python

visualization

matplotlib

n-gram

seaborn