如何使用 pandas 合并多个数据框并为假人添加一列?

How do I merge multiple dataframes and add a column for dummies using pandas?

我对合并多个数据框和添加一列虚拟对象有疑问。

现在我有两个原始输入数据帧。第一个数据框正在回答“你最喜欢哪种颜色?”的问题。第二个数据框正在回答“从 1 到 7 的等级,你在多大程度上不喜欢这种颜色?”

df1 = pd.DataFrame({'id': ['01','02'],
                    'like_wave_1': ['red','red'],
                    'like_wave_2': ['red','yellow']})
print(df1)

df2 = pd.DataFrame({'id': ['01','02'],
                    'dislike_wave1_yellow': ['7','2'],
                    'dislike_wave1_red':['1','1'],
                    'dislike_wave1_blue':['2','7'],
                    'dislike_wave2_yellow': ['7','1'],
                    'dislike_wave2_red':['1','2'],
                    'dislike_wave2_blue':['3','7']})
print(df2)

以下数据框构成了我预期的输出数据框的轮廓。

list_id = ['01','02']
list_color = ['yellow','red','blue']
list_wave = ['1','2']
expand = list(product(list_id, list_color, list_wave))
df = pd.DataFrame.from_records(expand, columns=['id', 'color', 'wave'])
print(df)
    id   color wave
0   01  yellow    1
1   01  yellow    2
2   01     red    1
3   01     red    2
4   01    blue    1
5   01    blue    2
6   02  yellow    1
7   02  yellow    2
8   02     red    1
9   02     red    2
10  02    blue    1
11  02    blue    2

我想向 df 添加两列:

(1) "like": 显示是否在特定 wave 中为特定 id 显示颜色的列(1 表示是,0 表示否)

(2) "不喜欢".

因此我预期的数据框是:

    id   color wave  like  dislike
0   01  yellow    1     0        7
1   01  yellow    2     0        7
2   01     red    1     1        1
3   01     red    2     1        1
4   01    blue    1     0        2
5   01    blue    2     0        3
6   02  yellow    1     0        2
7   02  yellow    2     1        1
8   02     red    1     1        1
9   02     red    2     0        2
10  02    blue    1     0        7
11  02    blue    2     0        7

你能帮我解决这个问题吗?非常感谢您的回答!

尝试将两个框架转换为与另一个框架兼容的格式:

DF1

# Get df1 into usable format
df1 = df1.set_index('id')
# Create Multi Index by splitting columns on '_'
df1.columns = df1.columns.str.split('_', expand=True)
# Stack to create long format frame
df1 = df1.stack().reset_index()
# Fix column names to match df2/output
df1.columns = ['id', 'wave', 'color']
# Set like to 1 for these since this table indicates likes
df1['like'] = 1

df1:

   id wave   color  like
0  01    1     red     1
1  01    2     red     1
2  02    1     red     1
3  02    2  yellow     1

DF2

# Get df2 into usable format
# Set index to ID
df2 = df2.set_index('id')
# Create Multi Index by splitting columns on '_'
df2.columns = df2.columns.str.split('_', expand=True)
# Stack to create long format frame
df2 = df2.stack(level=[1, 2]).reset_index()
# Fix column names to match df1
df2.columns = ['id', 'wave', 'color', 'dislike']
# Turn "wave1" into 1, "wave2" into 2, ... etc.
df2['wave'] = df2['wave'].str.lstrip('wave')

df2:

    id wave   color dislike
0   01    1    blue       2
1   01    1     red       1
2   01    1  yellow       7
3   01    2    blue       3
4   01    2     red       1
5   01    2  yellow       7
6   02    1    blue       7
7   02    1     red       1
8   02    1  yellow       2
9   02    2    blue       7
10  02    2     red       2
11  02    2  yellow       1

然后merge帧在一起:

# Merge On Common Columns
df3 = df1.merge(df2, on=['id', 'wave', 'color'], how='outer')

# Fill empty values in like and dislike with 0 (only 1s in source DF1)
# (Fill dislikes in case there are likes in df1 that are not dislikes in df2)
df3[['like', 'dislike']] = df3[['like', 'dislike']].fillna(0).astype(int)

# Sort Values and fix index (to match output in question)
df3 = df3.sort_values(
    ['id', 'color'], ascending=[True, False]
).reset_index(drop=True)

df3:

    id wave   color  like dislike
0   01    1  yellow     0       7
1   01    2  yellow     0       7
2   01    1     red     1       1
3   01    2     red     1       1
4   01    1    blue     0       2
5   01    2    blue     0       3
6   02    1  yellow     0       2
7   02    2  yellow     1       1
8   02    1     red     1       1
9   02    2     red     0       2
10  02    1    blue     0       7
11  02    2    blue     0       7

我们可以使用 pivot_longer from pyjanitor 在合并之前重塑各个数据帧:

left = (df1.pivot_longer('id', 
                         names_to=('.value', 'num'), 
                         names_pattern=r".+_(.+)_(\d$)")
           .rename(columns={"wave":"color",
                            "num":"wave"})
           .assign(like = 1)
         )

left
 
   id wave   color  like
0  01    1     red     1
1  02    1     red     1
2  01    2     red     1
3  02    2  yellow     1


right = (df2.pivot_longer('id',
                          names_to=(".value", "dislike", "color"), 
                          names_pattern = r".+_(.+)(\d)_(.+)", 
                          sort_by_appearance=True)
           .rename(columns = {"dislike":"wave", "wave":"dislike"})
          )

right
 
    id wave   color dislike
0   01    1  yellow       7
1   01    1     red       1
2   01    1    blue       2
3   01    2  yellow       7
4   01    2     red       1
5   01    2    blue       3
6   02    1  yellow       2
7   02    1     red       1
8   02    1    blue       7
9   02    2  yellow       1
10  02    2     red       2
11  02    2    blue       7

right.merge(left, how = 'outer').fillna(0)

    id wave   color dislike  like
0   01    1  yellow       7   0.0
1   01    1     red       1   1.0
2   01    1    blue       2   0.0
3   01    2  yellow       7   0.0
4   01    2     red       1   1.0
5   01    2    blue       3   0.0
6   02    1  yellow       2   0.0
7   02    1     red       1   1.0
8   02    1    blue       7   0.0
9   02    2  yellow       1   1.0
10  02    2     red       2   0.0
11  02    2    blue       7   0.0