获取 pandas 中两列之间的单词索引

Question

我正在检查 SpaCy 西班牙语词形还原器使用 .has_vector 方法处理哪些词。在数据名的两列中，我有函数的输出，指示哪些词可以词形还原，另一列是相应的短语。

我想知道如何提取所有输出为 False 的单词来更正它们，以便进行词形还原。

所以我创建了函数：

def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])

并将其应用于DataFrame中的列句

df["Vectors"] = df.reviews.apply(lemmatizer)

并放入另一个数据框中：

df2= pd.DataFrame(df[['Vectors', 'reviews']])

输出为

index             Vectors              reviews
  1     True True True False        'La pelicula es aburridora'

Answer 1

两种方法：

import pandas
import spacy

nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})

如果你想使用has_vector:

def get_oov1(text):
    return [word.text for word in nlp(text) if not word.has_vector]

或者您可以使用 is_oov attribute:

def get_oov2(text):
    return [word.text for word in nlp(text) if word.is_oov]

然后就像你已经做的那样：

df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)

哪个会 return:

>                                   reviews              oov_words1              oov_words2
  0  aaabbbcccc some example words xxxxyyyz  [aaabbbcccc, xxxxyyyz]  [aaabbbcccc, xxxxyyyz]

注：

当使用这两种方式时，重要的是要知道这是依赖于模型的，并且通常在较小的模型中没有 backbone 并且总是 return默认值！

这意味着当您运行使用完全相同的代码时，例如使用 en_core_web_sm 你会得到这个：

>                                  reviews oov_words1                                    oov_words2
  0  aaabbbcccc some example words xxxxyyyz         []  [aaabbbcccc, some, example, words, xxxxyyyz]

这是因为 has_vector 有一个默认值 False 并且不是由模型设置的。 is_oov 的默认值为 True 并且模型也没有。因此，对于 has_vector 模型，它错误地将所有单词显示为未知，而对于 is_oov，它错误地将所有单词显示为已知。

获取 pandas 中两列之间的单词索引

Obtaining the index of a word between two columns in pandas

nlp

multiple-columns

dataframe

pandas

spacy-3

注：