Flashtext 关键字提取在数据帧末尾返回 NaN
Flashtext keyword extraction is returning NaN at the end of the dataframe
用于从 FlashText 中提取关键字的 KeywordProcessor 在数据帧末尾返回 NaN。数据框的形状是 (14.532.885, 6),其中只有一列(包含句子)用于提取某些关键字。
关键字提取正确应用到第 14.452.474 行。换句话说,提取不适用于句子列的最后 80.411 行。
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
df['keywords'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df[['text', 'keywords']][14452474:14452480]
output:
text keywords
14452474 it is monsoon season in stl today rain rain r... [friendly]
14452475 hahahah pidgeons then []
14452476 nothing planned maybe ill go stay with u and h... []
14452477 he wont disappoint NaN
14452478 hi doc dickerson howdy opened a new twitter ac... NaN
14452479 only one more class left for today then im hom... NaN
临时解决方案是创建另一个应用相同函数的列,因为存在纯 NaN 值,在此之后将两个应用的列组合起来创建第三个新列,然后删除前两个列,因为它们将有 NaN 值。
df['keywords_1'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df['keywords_2'] = df['text'][14452477:].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df['keywords_result'] = df['text'][14452477:].apply(lambda x: kp.extract_keywords(x))
df['keyword'] = df['keywords_1'].combine_first(df['keywords_2'])
df.drop(columns=['keywords_1', 'keywords_2'])
用于从 FlashText 中提取关键字的 KeywordProcessor 在数据帧末尾返回 NaN。数据框的形状是 (14.532.885, 6),其中只有一列(包含句子)用于提取某些关键字。
关键字提取正确应用到第 14.452.474 行。换句话说,提取不适用于句子列的最后 80.411 行。
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
df['keywords'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df[['text', 'keywords']][14452474:14452480]
output:
text keywords
14452474 it is monsoon season in stl today rain rain r... [friendly]
14452475 hahahah pidgeons then []
14452476 nothing planned maybe ill go stay with u and h... []
14452477 he wont disappoint NaN
14452478 hi doc dickerson howdy opened a new twitter ac... NaN
14452479 only one more class left for today then im hom... NaN
临时解决方案是创建另一个应用相同函数的列,因为存在纯 NaN 值,在此之后将两个应用的列组合起来创建第三个新列,然后删除前两个列,因为它们将有 NaN 值。
df['keywords_1'] = df['text'].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df['keywords_2'] = df['text'][14452477:].apply(lambda sentence: kp.extract_keywords(sentence=sentence, span_info=False))
df['keywords_result'] = df['text'][14452477:].apply(lambda x: kp.extract_keywords(x))
df['keyword'] = df['keywords_1'].combine_first(df['keywords_2'])
df.drop(columns=['keywords_1', 'keywords_2'])