从元组中删除重复项

Remove duplicates from a tuple

我试图从文本中提取关键字。通过使用“en_core_sci_lg”模型,我得到了一个 phrases/words 的元组类型,其中包含一些我试图从中删除的重复项。我尝试了列表和元组的重复数据删除功能,但我只失败了。谁能帮忙?非常感谢。

text = """spaCy is an open-source software library for advanced natural language processing,
written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""

我试过的一组代码:

import spacy
nlp = spacy.load("en_core_sci_lg")

doc = nlp(text)
my_tuple = list(set(doc.ents))
print('original tuple', doc.ents, len(doc.ents))
print('after set function', my_tuple, len(my_tuple))

输出:

original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16

after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16

期望的输出是(应该有一个MIT,名字Ines Honnibal应该在一起):

[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]

doc.ents 不是字符串列表。它是 Span 个对象的列表。当您打印一个时,它会打印其内容,但它们确实是单独的对象,这就是为什么 set 看不到它们是重复的。线索是您的打印语句中没有引号。如果这些是字符串,您会看到引号。

您应该尝试使用 doc.words 而不是 doc.ents。如果这对你不起作用,出于某种原因,你可以这样做:

my_tuple = list(set(e.text for e in doc.ents))