我如何可视化两个 columns/lists 的三元组以查看两个 columns/lists 中是否出现相同的单词组合?

How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?

所以我有两个三元组列表(每个 20 个单词组合)例如

l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...

l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...

现在我想在一张图表(也许是配对图)中可视化这两个列表,看看是否有相似之处(所有 3 个词必须相同)。

提前致谢

INPUT:

l1 = [('hello', 'its', 'me'), ('I', 'need', 'help') ...]
l2 = [('I', 'need', 'help'), ('What', 'is', 'this') ...]

OUTPUT:

sim = [[('hello', 'its', 'me'), 1], [('I', 'need', 'help'), 2], [('What', 'is', 'this'), 1]]
merged = l1 + l2
unique = set(merged)
results = []

for tri in unique:
    results.append([tri, merged.count(tri)])

从您的描述来看,这似乎就是您要找的东西。如果需要任何调整,请告诉我。

Larry the Llama 给出的答案似乎错过了“查看是否有相似之处”,因为该解决方案使用 set() 来删除所有重复项。

如果您希望通过完整的迭代来找到完全相似的三元组:

merged = l1 + l2

results_counter = {}

# Iterate all the trigrams
for index, trigram in enumerate(merged):
    # Iterate all the trigrams which lay after in the array
    for second_index in range(index, len(merged)):
        all_same = True

        # Find all of which are the same as the comparing trigram
        for word_index, word in enumerate(trigram):
            if merged[second_index][word_index] == trigram[word_index:
                all_same = False
                break
        
        # If trigram was not found in the results_counter add the key else returning the value 
        previous_found = results_counter.setDefault(str(trigram), 0)
        # Add one
        previous_found[str(trigram)] += 1

# Will print the keys and the 
for key in previous_found.keys():
    # Print the count for each trigram
    print(key, previous_found[key])

澄清后编辑:

import seaborn as sns
import pandas as pd

d1 = [("hello", "its", "me"), ("dont", "its", "me")]
d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]

word_to_number = {} 
number_to_word = {} # if you want to show the sentence again
def one_hot(l):
    """
    This function one hot encodes (converts each appearens of a word
    to a number) and returns the encoded list while also adding the
    keys to converter dictionaries for reverse converting.
    """
    one_hot_encoded = []
    for trigram in l:
        encoded_trigram = []
        for word in trigram:
            # Add encoding of the word
            encoded_word = word_to_number.setdefault(word, len(word_to_number))
            number_to_word[encoded_word] = word
            # Add to the one hot encoded trigram = {} 
            encoded_trigram.append(encoded_word)
        
        # Add to the list which is sent in
        one_hot_encoded.append(encoded_trigram)

    return one_hot_encoded

d1 = one_hot(d1)
d2 = one_hot(d2)

data = {}
for ind, trigram in enumerate(d1 + d2):
    # This will add each word to be compared
    data["t" + str(ind)] = trigram

frame = pd.DataFrame.from_dict(data)
print(frame)

plot = sns.pairplot(frame)
# Make it clear
plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))

import matplotlib.pyplot as plt
plt.show()

这篇文章将为您提供三元组的配对图,尽管它不是很直观,因为您必须寻找精确的线性值。您可以使用它,但请确保您没有太多不同的词,因为这会扭曲轴并使视觉上很难看到结果。