使用 python 中的 for 循环遍历列表中列表中的标记 (SpaCy)

Question

我比较新，所以我可能会犯一些非常基本的错误，但据我了解，您将在 python 中的列表中的列表中迭代标记，如下所示：

for each_list in full_list:
  for each_token in each_list:
    do whatever you wannna do

但是，当使用 SpaCy 时，第一个 for 循环似乎是在标记而不是列表上迭代。

所以代码：

for eachlist in alice:
  if len(eachlist) > 5:
     print eachlist

（其中 alice 是一个列表列表，每个列表是一个包含标记化单词的句子）

实际上打印每个超过 5 个字母的单词，而不是每个超过 5 个单词的句子（如果它真的在 "first level" for 循环中，它应该这样做。

代码：

newalice = []
for eachlist in alice:
  for eachword in eachlist:
    #make a new list of lists where each list contains only words that are classified as nouns, adjectives, or verbs (with a few more specific stipulations)
    if (eachword.pos_ == 'NOUN' or eachword.pos_ == 'VERB' or eachword.pos_ == 'ADJ') and (eachword.dep_ != 'aux') and (eachword.dep_ != 'conj'):
        newalice.append([eachword])

returns 错误："TypeError: 'spacy.tokens.token.Token' object is not iterable."

我想在嵌套的 for 循环中这样做的原因是我希望 newalice 成为列表的列表（我仍然希望能够遍历句子，我只是想去掉单词我不在乎）。

我不知道我是否在我的代码中犯了一些非常基本的错误，或者 SpaCy 是否在做一些奇怪的事情，但无论哪种方式，我都非常感谢任何关于如何迭代列表中的项目的帮助-在 SpaCy 中的列表中，同时保持原始列表的完整性。

Answer 1

下面是迭代嵌套列表元素的代码：

list_inst = [ ["this", " ", "is", " ", "a", " ", "sentence"], ["another", " ", "one"]]
for sentence in list_inst:
    for token in sentence:
        print(token, end="")
    print("")

我认为你的误解是因为spacy中的每一个句子都没有存储在一个列表中，而是存储在一个doc对象中。 doc 对象是可迭代的并且包含标记，但也包含一些额外信息。

示例代码：

# iterate to sentences after spacy preprocessing
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp("this is a sentence")
doc2 = nlp("another one")
list_inst = [doc1, doc2]
for doc in list_inst:
    for token in doc:
        print(token, end=" ")
    print("")

输出是相同的。

希望对您有所帮助！

使用 python 中的 for 循环遍历列表中列表中的标记 (SpaCy)

Iterating over tokens within lists within lists using for-loops in python (SpaCy)

python

for-loop

spacy