如何将 pandas 数据框的列行中的元组转换为重复的行和列?
how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns?
我有一个包含以下数据的数据框(这里只提供了 3 个样本):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
基本上,每一行都包含前 10 个词的元组以及它们在每个部门中的频率。
我想创建一个数据框,其中(让部门名称重复并且)每一行在一列中包含来自元组的单词,在另一列中包含频率计数,这样它应该看起来像这样:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
是否有解决此类转换的方法?
我建议您在加载到数据框之前使用 data
字典进行整理:
length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],
columns = ['Word', 'Counts'])
.assign(**department)
)
Word Counts Department
0 cat 6 D1
1 project 6 D1
2 dog 6 D1
3 develop 4 D1
4 smooth 4 D1
5 efficient 4 D1
6 administrative 4 D1
7 procedure 4 D1
8 establishment 3 D1
9 matter 3 D1
10 management 21 D2
11 satisfaction 12 D2
12 within 9 D2
13 budget 9 D2
14 township 9 D2
15 site 9 D2
16 periodic 9 D2
17 admin 9 D2
18 maintain 9 D2
19 guest 6 D2
20 manage 2 D3
21 ir 2 D3
22 mines 2 D3
23 implimentation 2 D3
24 clrg 2 D3
25 act 2 D3
26 implementations 2 D3
27 office 2 D3
28 maintenance 2 D3
29 administration 2 D3
首先,使用DataFrame.explode
to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign
+ Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
输出:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implimentation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
正如@sammywemmy 所建议的,如果您要处理大量数据,在将其加载到 DataFrame 之前对其进行处理会更快。
另一种使用嵌套循环的方法
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)
一个使用字典理解的选项:
(df
.drop(columns='TopWords')
.join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
for k,x in enumerate(df['TopWords'])}).droplevel(1))
)
输出:
Department Word Counts
0 D1 cat 6
0 D1 project 6
0 D1 dog 6
0 D1 develop 4
0 D1 smooth 4
0 D1 efficient 4
0 D1 administrative 4
0 D1 procedure 4
0 D1 establishment 3
0 D1 matter 3
1 D2 management 21
1 D2 satisfaction 12
1 D2 within 9
1 D2 budget 9
1 D2 township 9
1 D2 site 9
1 D2 periodic 9
1 D2 admin 9
1 D2 maintain 9
1 D2 guest 6
2 D3 manage 2
2 D3 ir 2
2 D3 mines 2
2 D3 implimentation 2
2 D3 clrg 2
2 D3 act 2
2 D3 implementations 2
2 D3 office 2
2 D3 maintenance 2
2 D3 administration 2
除了@sammywemmy 的回答之外,以下方法不需要 numpy
包,但是由于双循环,它在大量数据集中的性能可能不佳。
d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
for word, count in data["TopWords"][idx]:
d["Department"].append(department)
d["Words"].append(word)
d["Count"].append(count)
print(pd.DataFrame(d))
@Rodalm 使用枚举使这段代码更具可读性。以前我用过简单的 range()
我有一个包含以下数据的数据框(这里只提供了 3 个样本):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
基本上,每一行都包含前 10 个词的元组以及它们在每个部门中的频率。
我想创建一个数据框,其中(让部门名称重复并且)每一行在一列中包含来自元组的单词,在另一列中包含频率计数,这样它应该看起来像这样:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
是否有解决此类转换的方法?
我建议您在加载到数据框之前使用 data
字典进行整理:
length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],
columns = ['Word', 'Counts'])
.assign(**department)
)
Word Counts Department
0 cat 6 D1
1 project 6 D1
2 dog 6 D1
3 develop 4 D1
4 smooth 4 D1
5 efficient 4 D1
6 administrative 4 D1
7 procedure 4 D1
8 establishment 3 D1
9 matter 3 D1
10 management 21 D2
11 satisfaction 12 D2
12 within 9 D2
13 budget 9 D2
14 township 9 D2
15 site 9 D2
16 periodic 9 D2
17 admin 9 D2
18 maintain 9 D2
19 guest 6 D2
20 manage 2 D3
21 ir 2 D3
22 mines 2 D3
23 implimentation 2 D3
24 clrg 2 D3
25 act 2 D3
26 implementations 2 D3
27 office 2 D3
28 maintenance 2 D3
29 administration 2 D3
首先,使用DataFrame.explode
to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign
+ Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
输出:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implimentation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
正如@sammywemmy 所建议的,如果您要处理大量数据,在将其加载到 DataFrame 之前对其进行处理会更快。
另一种使用嵌套循环的方法
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)
一个使用字典理解的选项:
(df
.drop(columns='TopWords')
.join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
for k,x in enumerate(df['TopWords'])}).droplevel(1))
)
输出:
Department Word Counts
0 D1 cat 6
0 D1 project 6
0 D1 dog 6
0 D1 develop 4
0 D1 smooth 4
0 D1 efficient 4
0 D1 administrative 4
0 D1 procedure 4
0 D1 establishment 3
0 D1 matter 3
1 D2 management 21
1 D2 satisfaction 12
1 D2 within 9
1 D2 budget 9
1 D2 township 9
1 D2 site 9
1 D2 periodic 9
1 D2 admin 9
1 D2 maintain 9
1 D2 guest 6
2 D3 manage 2
2 D3 ir 2
2 D3 mines 2
2 D3 implimentation 2
2 D3 clrg 2
2 D3 act 2
2 D3 implementations 2
2 D3 office 2
2 D3 maintenance 2
2 D3 administration 2
除了@sammywemmy 的回答之外,以下方法不需要 numpy
包,但是由于双循环,它在大量数据集中的性能可能不佳。
d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
for word, count in data["TopWords"][idx]:
d["Department"].append(department)
d["Words"].append(word)
d["Count"].append(count)
print(pd.DataFrame(d))
@Rodalm 使用枚举使这段代码更具可读性。以前我用过简单的 range()