pandas 爆炸产生意想不到的结果
pandas Explode producing unexpected results
我正在尝试分解数据框的一列以获得多行。展开它的列称为关键字,它是来自 FlashText 包的作为关键字编辑的情绪列表 return。这意味着如果关键字在文本列(带有句子的列)中,那么它将 return 对应于该句子的情绪或多种情绪
如果我使用我创建的示例数据框,这与预期输出完美配合,但是当应用于数据框时爆炸它 return 是行的随机组合。
我认为这个意想不到的结果是因为数据帧有重复的索引,然而,删除它们给出了同样的错误结果。
预期输出
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick',
# NaN results to empty list
'whatever',
'[]',
'body of missing northern calif girl found poli',
'i miss kenny powers',
'sorry tell them mea culpa from me and that i really am sorry'
]
})
# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))
# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes
# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})
test_df
text keywords
0 I really hate and love love everyone best conf... unfriendly
1 I really hate and love love everyone best conf... friendly
2 I really hate and love love everyone best conf... friendly
3 I really hate and love love everyone best conf... confident
4 I really hate and love love everyone best conf... insecure
5 i should be sleeping i have a stressed out wee... neg_hp
6 late night snack glass of oj bc im quotdown wi... unfriendly
7 whatever []
8 [] []
9 body of missing northern calif girl found poli []
10 i miss kenny powers []
11 sorry tell them mea culpa from me and that i ... sadness
12 sorry tell them mea culpa from me and that i ... sadness
没有爆炸的当前输出
这里是句子 i miss kenny powers
return 一个空列表
当前带爆炸的输出
这里的句子i miss kenny powers
return的情绪confident
,是错误的
当前使用 csv 包对我有效的解决方案:
# New solution : exploding with csv
import csv
CSV_PATH = 'temp_data.csv'
data = []
df_concat.to_csv(CSV_PATH)
with open(file=CSV_PATH, mode='r') as f:
reader = csv.DictReader(f)
columns = reader.fieldnames
print(columns)
for record in reader:
keywords = eval(record['keywords'])
if not keywords:
data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
for keyword in keywords:
data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
df_concat = pd.DataFrame(data, columns=['text', 'keywords'])
我正在尝试分解数据框的一列以获得多行。展开它的列称为关键字,它是来自 FlashText 包的作为关键字编辑的情绪列表 return。这意味着如果关键字在文本列(带有句子的列)中,那么它将 return 对应于该句子的情绪或多种情绪
如果我使用我创建的示例数据框,这与预期输出完美配合,但是当应用于数据框时爆炸它 return 是行的随机组合。
我认为这个意想不到的结果是因为数据帧有重复的索引,然而,删除它们给出了同样的错误结果。
预期输出
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick',
# NaN results to empty list
'whatever',
'[]',
'body of missing northern calif girl found poli',
'i miss kenny powers',
'sorry tell them mea culpa from me and that i really am sorry'
]
})
# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))
# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes
# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})
test_df
text keywords
0 I really hate and love love everyone best conf... unfriendly
1 I really hate and love love everyone best conf... friendly
2 I really hate and love love everyone best conf... friendly
3 I really hate and love love everyone best conf... confident
4 I really hate and love love everyone best conf... insecure
5 i should be sleeping i have a stressed out wee... neg_hp
6 late night snack glass of oj bc im quotdown wi... unfriendly
7 whatever []
8 [] []
9 body of missing northern calif girl found poli []
10 i miss kenny powers []
11 sorry tell them mea culpa from me and that i ... sadness
12 sorry tell them mea culpa from me and that i ... sadness
没有爆炸的当前输出
这里是句子 i miss kenny powers
return 一个空列表
当前带爆炸的输出
这里的句子i miss kenny powers
return的情绪confident
,是错误的
当前使用 csv 包对我有效的解决方案:
# New solution : exploding with csv
import csv
CSV_PATH = 'temp_data.csv'
data = []
df_concat.to_csv(CSV_PATH)
with open(file=CSV_PATH, mode='r') as f:
reader = csv.DictReader(f)
columns = reader.fieldnames
print(columns)
for record in reader:
keywords = eval(record['keywords'])
if not keywords:
data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
for keyword in keywords:
data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
df_concat = pd.DataFrame(data, columns=['text', 'keywords'])