如何将 pandas 数据框的列行中的元组转换为重复的行和列?

how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns?

我有一个包含以下数据的数据框(这里只提供了 3 个样本):

data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}


# Create DataFrame
df = pd.DataFrame(data)

基本上,每一行都包含前 10 个词的元组以及它们在每个部门中的频率。

我想创建一个数据框,其中(让部门名称重复并且)每一行在一列中包含来自元组的单词,在另一列中包含频率计数,这样它应该看起来像这样:

Department  Word            Counts
D1          cat             6
D1          project         6
D1          dog             6
D1          develop         4
D1          smooth          4
D1          efficient       4
D1          administrative  4
D1          procedure       4
D1          establishment   3
D1          matter          3
D2          management      21
D2          satisfaction    12
D2          within          9
D2          budget          9
D2          township        9

是否有解决此类转换的方法?

我建议您在加载到数据框之前使用 data 字典进行整理:

length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],     
            columns = ['Word', 'Counts'])
.assign(**department)
)

               Word  Counts Department
0               cat       6         D1
1           project       6         D1
2               dog       6         D1
3           develop       4         D1
4            smooth       4         D1
5         efficient       4         D1
6    administrative       4         D1
7         procedure       4         D1
8     establishment       3         D1
9            matter       3         D1
10       management      21         D2
11     satisfaction      12         D2
12           within       9         D2
13           budget       9         D2
14         township       9         D2
15             site       9         D2
16         periodic       9         D2
17            admin       9         D2
18         maintain       9         D2
19            guest       6         D2
20           manage       2         D3
21               ir       2         D3
22            mines       2         D3
23   implimentation       2         D3
24             clrg       2         D3
25              act       2         D3
26  implementations       2         D3
27           office       2         D3
28      maintenance       2         D3
29   administration       2         D3

首先,使用DataFrame.explode to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign + Series.str

res = (
    df.explode('TopWords', ignore_index=True)
      .assign(Word=lambda df: df['TopWords'].str[0], 
              Counts=lambda df: df['TopWords'].str[1])
      .drop(columns='TopWords')
)  

输出:

>>> res 

   Department             Word  Counts
0          D1              cat       6
1          D1          project       6
2          D1              dog       6
3          D1          develop       4
4          D1           smooth       4
5          D1        efficient       4
6          D1   administrative       4
7          D1        procedure       4
8          D1    establishment       3
9          D1           matter       3
10         D2       management      21
11         D2     satisfaction      12
12         D2           within       9
13         D2           budget       9
14         D2         township       9
15         D2             site       9
16         D2         periodic       9
17         D2            admin       9
18         D2         maintain       9
19         D2            guest       6
20         D3           manage       2
21         D3               ir       2
22         D3            mines       2
23         D3   implimentation       2
24         D3             clrg       2
25         D3              act       2
26         D3  implementations       2
27         D3           office       2
28         D3      maintenance       2
29         D3   administration       2

正如@sammywemmy 所建议的,如果您要处理大量数据,在将其加载到 DataFrame 之前对其进行处理会更快。

另一种使用嵌套循环的方法

data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}

records = []
for idx, top_words_list in enumerate(data['TopWords']):
    for word, count in top_words_list:
        rec = {
            'Department': data['Department'][idx],
            'Word': word,
            'Count': count
        }
        records.append(rec)
        
res = pd.DataFrame(records)

一个使用字典理解的选项:

(df
 .drop(columns='TopWords')
 .join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
                  for k,x in enumerate(df['TopWords'])}).droplevel(1))

)

输出:

  Department             Word  Counts
0         D1              cat       6
0         D1          project       6
0         D1              dog       6
0         D1          develop       4
0         D1           smooth       4
0         D1        efficient       4
0         D1   administrative       4
0         D1        procedure       4
0         D1    establishment       3
0         D1           matter       3
1         D2       management      21
1         D2     satisfaction      12
1         D2           within       9
1         D2           budget       9
1         D2         township       9
1         D2             site       9
1         D2         periodic       9
1         D2            admin       9
1         D2         maintain       9
1         D2            guest       6
2         D3           manage       2
2         D3               ir       2
2         D3            mines       2
2         D3   implimentation       2
2         D3             clrg       2
2         D3              act       2
2         D3  implementations       2
2         D3           office       2
2         D3      maintenance       2
2         D3   administration       2

除了@sammywemmy 的回答之外,以下方法不需要 numpy 包,但是由于双循环,它在大量数据集中的性能可能不佳。

d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
    for word, count in data["TopWords"][idx]:
        d["Department"].append(department)
        d["Words"].append(word)
        d["Count"].append(count)

print(pd.DataFrame(d))

@Rodalm 使用枚举使这段代码更具可读性。以前我用过简单的 range()