pandas 爆炸产生意想不到的结果

pandas Explode producing unexpected results

我正在尝试分解数据框的一列以获得多行。展开它的列称为关键字,它是来自 FlashText 包的作为关键字编辑的情绪列表 return。这意味着如果关键字在文本列(带有句子的列)中,那么它将 return 对应于该句子的情绪或多种情绪

如果我使用我创建的示例数据框,这与预期输出完美配合,但是当应用于数据框时爆炸它 return 是行的随机组合。

我认为这个意想不到的结果是因为数据帧有重复的索引,然而,删除它们给出了同样的错误结果。

预期输出

from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)


test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
                                 'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick', 
                                 
                                 # NaN results to empty list
                                 'whatever', 
                                 '[]', 
                                 'body of missing northern calif girl found poli', 
                                 'i miss kenny powers',

                                 'sorry  tell them mea culpa from me and that i really am sorry'
                        ]
                        })

# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))

# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes

# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})


test_df
    text                                                keywords
0   I really hate and love love everyone best conf...   unfriendly
1   I really hate and love love everyone best conf...   friendly
2   I really hate and love love everyone best conf...   friendly
3   I really hate and love love everyone best conf...   confident
4   I really hate and love love everyone best conf...   insecure
5   i should be sleeping i have a stressed out wee...   neg_hp
6   late night snack glass of oj bc im quotdown wi...   unfriendly
7   whatever                                            []
8   []                                                  []
9   body of missing northern calif girl found poli      []
10  i miss kenny powers                                 []
11  sorry tell them mea culpa from me and that i ...    sadness
12  sorry tell them mea culpa from me and that i ...    sadness

没有爆炸的当前输出

这里是句子 i miss kenny powers return 一个空列表

当前带爆炸的输出

这里的句子i miss kenny powersreturn的情绪confident,是错误的

数据框:dataframe sample 40k

当前使用 csv 包对我有效的解决方案:

# New solution : exploding with csv
import csv

CSV_PATH = 'temp_data.csv'
data = []

df_concat.to_csv(CSV_PATH)

with open(file=CSV_PATH, mode='r') as f:
    reader = csv.DictReader(f)
    columns = reader.fieldnames

    print(columns)

    for record in reader:
        keywords = eval(record['keywords'])

        if not keywords:
            data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']

        for keyword in keywords:
            data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']

df_concat = pd.DataFrame(data, columns=['text', 'keywords'])