比较列表(A)中的项目是否作为列表(B)中的子项目存在

Comparing if Item in List(A) exists as Sub-Item in List(B)

我有 2 个列表,每个列表都是字符串的集合,我想检查 list(A) 的项目是否存在于 list(B) 的另一个项目中。 所以在 list(A) 中有应该在 list(B) 中找到的标准词和短语。 我用这个 (e.g. "innovation", "innovative", "new ways to go") 填充了 List(A) 并且 lemmatized(['innovation'], ['innovative'], ['new', 'way', 'go'].

list(B)中有tokenizedlemmatized个文本的句子('time', new', 'way', 'go']

在该模式中,我尝试分析文本中给定的单词和短语是否出现以及出现的频率。

为了匹配我读到的模式,它需要将每个列表元素本身转换为一个字符串,以检查它是否是 list(b).

中字符串的子字符串
    list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
    list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]

    for x in range(len(list_a)):
        for j in range(len(list_b)):
            a = " ".join(list_a[x])
            if any(a in s for s in list_b[j]):
                print("word of list a: ", a, " appears in list b: ", list_b[j])    `

实际输出为:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']

我的目标输出是:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']
word of list a: new way go appears in list b: ['time', 'go', 'new', 'way']

list(b) 的项目转换为字符串,就像我尝试使用 list(a) 对我没有帮助。

感谢您的帮助!

第一个错误是:不要从单词列表中创建字符串。使用 set 个单词和设置方法(此处:issubset

  • 将您的单词列表列表转换为单词集列表
  • 在第一个列表 (a) 的集合中循环并检查集合是否 包含list_b 的集合之一中(不使用 any否则我们无法知道哪个集合包含当前集合,一个简单的循环就可以)

像这样:

list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]

list_a = [set(x) for x in list_a]
list_b = [set(x) for x in list_b]

for subset in list_a:
    for other_subset in list_b:
        if subset.issubset(other_subset):
            print("{} appears in list b: {}".format(subset,other_subset))

打印:

{'innovation'} appears in list b: {'time', 'innovation'}
{'innovative'} appears in list b: {'look', 'creative', 'innovative', 'people'}
{'new', 'go', 'way'} appears in list b: {'time', 'new', 'go', 'way'}

现在,如果您想保留顺序,但仍想从 set 的元素测试优势中获益,只需为 list_b 创建元组列表,因为它已迭代多次。不需要为 list_a 做同样的事情,因为它只迭代一次:

# list_a is now unchanged
list_b = [(set(x),x) for x in list_b]

for sublist in list_a:
    subset = set(sublist)
    for other_subset,other_sublist in list_b:
        if subset.issubset(other_subset):
            print("{} appears in list b: {}".format(sublist,other_sublist))

结果:

['innovation'] appears in list b: ['time', 'innovation']
['innovative'] appears in list b: ['look', 'innovative', 'creative', 'people']
['new', 'way', 'go'] appears in list b: ['time', 'go', 'new', 'way']

算法仍然很昂贵:O(n**3) 但不是 O(n**4) 感谢 O(n) 设置查找(与列表查找相比)来测试一个单词列表是否包含在另一个列表中.

假设您只想匹配 a 中的一个列表中的所有单词都包含在 B 中的一个列表中,那么您可以使用。

list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'], ['look', 'innovative', 'creative', 'people'], ['way', 'go', 'time']]

for a_element in list_a:
    for b_element in list_b:
        for a_element_item in a_element:
            if a_element_item not in b_element:
                break
        else:
            print(a_element, "is in ", b_element)

输出

['innovation'] is in  ['time', 'innovation']
['innovative'] is in  ['look', 'innovative', 'creative', 'people']
['new', 'way', 'go'] is in  ['time', 'go', 'new', 'way']