Flashtext 关键字提取在数据帧末尾返回 NaN

Flashtext keyword extraction is returning NaN at the end of the dataframe

用于从 FlashText 中提取关键字的 KeywordProcessor 在数据帧末尾返回 NaN。数据框的形状是 (14.532.885, 6),其中只有一列(包含句子)用于提取某些关键字。

关键字提取正确应用到第 14.452.474 行。换句话说,提取不适用于句子列的最后 80.411 行。

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)

df['keywords'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))


df[['text', 'keywords']][14452474:14452480]

output:
            text                                                keywords
14452474    it is monsoon season in stl today rain rain r...    [friendly]
14452475    hahahah pidgeons then                               []
14452476    nothing planned maybe ill go stay with u and h...   []
14452477    he wont disappoint                                  NaN
14452478    hi doc dickerson howdy opened a new twitter ac...   NaN
14452479    only one more class left for today then im hom...   NaN

临时解决方案是创建另一个应用相同函数的列,因为存在纯 NaN 值,在此之后将两个应用的列组合起来创建第三个新列,然后删除前两个列,因为它们将有 NaN 值。

df['keywords_1'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))

df['keywords_2'] = df['text'][14452477:].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))

df['keywords_result'] = df['text'][14452477:].apply(lambda x: kp.extract_keywords(x))
df['keyword'] = df['keywords_1'].combine_first(df['keywords_2'])
df.drop(columns=['keywords_1', 'keywords_2'])