如何使用 pandas 合并多个数据框并为假人添加一列?
How do I merge multiple dataframes and add a column for dummies using pandas?
我对合并多个数据框和添加一列虚拟对象有疑问。
现在我有两个原始输入数据帧。第一个数据框正在回答“你最喜欢哪种颜色?”的问题。第二个数据框正在回答“从 1 到 7 的等级,你在多大程度上不喜欢这种颜色?”
df1 = pd.DataFrame({'id': ['01','02'],
'like_wave_1': ['red','red'],
'like_wave_2': ['red','yellow']})
print(df1)
df2 = pd.DataFrame({'id': ['01','02'],
'dislike_wave1_yellow': ['7','2'],
'dislike_wave1_red':['1','1'],
'dislike_wave1_blue':['2','7'],
'dislike_wave2_yellow': ['7','1'],
'dislike_wave2_red':['1','2'],
'dislike_wave2_blue':['3','7']})
print(df2)
以下数据框构成了我预期的输出数据框的轮廓。
list_id = ['01','02']
list_color = ['yellow','red','blue']
list_wave = ['1','2']
expand = list(product(list_id, list_color, list_wave))
df = pd.DataFrame.from_records(expand, columns=['id', 'color', 'wave'])
print(df)
id color wave
0 01 yellow 1
1 01 yellow 2
2 01 red 1
3 01 red 2
4 01 blue 1
5 01 blue 2
6 02 yellow 1
7 02 yellow 2
8 02 red 1
9 02 red 2
10 02 blue 1
11 02 blue 2
我想向 df 添加两列:
(1) "like": 显示是否在特定 wave 中为特定 id 显示颜色的列(1 表示是,0 表示否)
(2) "不喜欢".
因此我预期的数据框是:
id color wave like dislike
0 01 yellow 1 0 7
1 01 yellow 2 0 7
2 01 red 1 1 1
3 01 red 2 1 1
4 01 blue 1 0 2
5 01 blue 2 0 3
6 02 yellow 1 0 2
7 02 yellow 2 1 1
8 02 red 1 1 1
9 02 red 2 0 2
10 02 blue 1 0 7
11 02 blue 2 0 7
你能帮我解决这个问题吗?非常感谢您的回答!
尝试将两个框架转换为与另一个框架兼容的格式:
DF1
# Get df1 into usable format
df1 = df1.set_index('id')
# Create Multi Index by splitting columns on '_'
df1.columns = df1.columns.str.split('_', expand=True)
# Stack to create long format frame
df1 = df1.stack().reset_index()
# Fix column names to match df2/output
df1.columns = ['id', 'wave', 'color']
# Set like to 1 for these since this table indicates likes
df1['like'] = 1
df1
:
id wave color like
0 01 1 red 1
1 01 2 red 1
2 02 1 red 1
3 02 2 yellow 1
DF2
# Get df2 into usable format
# Set index to ID
df2 = df2.set_index('id')
# Create Multi Index by splitting columns on '_'
df2.columns = df2.columns.str.split('_', expand=True)
# Stack to create long format frame
df2 = df2.stack(level=[1, 2]).reset_index()
# Fix column names to match df1
df2.columns = ['id', 'wave', 'color', 'dislike']
# Turn "wave1" into 1, "wave2" into 2, ... etc.
df2['wave'] = df2['wave'].str.lstrip('wave')
df2
:
id wave color dislike
0 01 1 blue 2
1 01 1 red 1
2 01 1 yellow 7
3 01 2 blue 3
4 01 2 red 1
5 01 2 yellow 7
6 02 1 blue 7
7 02 1 red 1
8 02 1 yellow 2
9 02 2 blue 7
10 02 2 red 2
11 02 2 yellow 1
然后merge
帧在一起:
# Merge On Common Columns
df3 = df1.merge(df2, on=['id', 'wave', 'color'], how='outer')
# Fill empty values in like and dislike with 0 (only 1s in source DF1)
# (Fill dislikes in case there are likes in df1 that are not dislikes in df2)
df3[['like', 'dislike']] = df3[['like', 'dislike']].fillna(0).astype(int)
# Sort Values and fix index (to match output in question)
df3 = df3.sort_values(
['id', 'color'], ascending=[True, False]
).reset_index(drop=True)
df3
:
id wave color like dislike
0 01 1 yellow 0 7
1 01 2 yellow 0 7
2 01 1 red 1 1
3 01 2 red 1 1
4 01 1 blue 0 2
5 01 2 blue 0 3
6 02 1 yellow 0 2
7 02 2 yellow 1 1
8 02 1 red 1 1
9 02 2 red 0 2
10 02 1 blue 0 7
11 02 2 blue 0 7
我们可以使用 pivot_longer from pyjanitor 在合并之前重塑各个数据帧:
left = (df1.pivot_longer('id',
names_to=('.value', 'num'),
names_pattern=r".+_(.+)_(\d$)")
.rename(columns={"wave":"color",
"num":"wave"})
.assign(like = 1)
)
left
id wave color like
0 01 1 red 1
1 02 1 red 1
2 01 2 red 1
3 02 2 yellow 1
right = (df2.pivot_longer('id',
names_to=(".value", "dislike", "color"),
names_pattern = r".+_(.+)(\d)_(.+)",
sort_by_appearance=True)
.rename(columns = {"dislike":"wave", "wave":"dislike"})
)
right
id wave color dislike
0 01 1 yellow 7
1 01 1 red 1
2 01 1 blue 2
3 01 2 yellow 7
4 01 2 red 1
5 01 2 blue 3
6 02 1 yellow 2
7 02 1 red 1
8 02 1 blue 7
9 02 2 yellow 1
10 02 2 red 2
11 02 2 blue 7
right.merge(left, how = 'outer').fillna(0)
id wave color dislike like
0 01 1 yellow 7 0.0
1 01 1 red 1 1.0
2 01 1 blue 2 0.0
3 01 2 yellow 7 0.0
4 01 2 red 1 1.0
5 01 2 blue 3 0.0
6 02 1 yellow 2 0.0
7 02 1 red 1 1.0
8 02 1 blue 7 0.0
9 02 2 yellow 1 1.0
10 02 2 red 2 0.0
11 02 2 blue 7 0.0
我对合并多个数据框和添加一列虚拟对象有疑问。
现在我有两个原始输入数据帧。第一个数据框正在回答“你最喜欢哪种颜色?”的问题。第二个数据框正在回答“从 1 到 7 的等级,你在多大程度上不喜欢这种颜色?”
df1 = pd.DataFrame({'id': ['01','02'],
'like_wave_1': ['red','red'],
'like_wave_2': ['red','yellow']})
print(df1)
df2 = pd.DataFrame({'id': ['01','02'],
'dislike_wave1_yellow': ['7','2'],
'dislike_wave1_red':['1','1'],
'dislike_wave1_blue':['2','7'],
'dislike_wave2_yellow': ['7','1'],
'dislike_wave2_red':['1','2'],
'dislike_wave2_blue':['3','7']})
print(df2)
以下数据框构成了我预期的输出数据框的轮廓。
list_id = ['01','02']
list_color = ['yellow','red','blue']
list_wave = ['1','2']
expand = list(product(list_id, list_color, list_wave))
df = pd.DataFrame.from_records(expand, columns=['id', 'color', 'wave'])
print(df)
id color wave
0 01 yellow 1
1 01 yellow 2
2 01 red 1
3 01 red 2
4 01 blue 1
5 01 blue 2
6 02 yellow 1
7 02 yellow 2
8 02 red 1
9 02 red 2
10 02 blue 1
11 02 blue 2
我想向 df 添加两列:
(1) "like": 显示是否在特定 wave 中为特定 id 显示颜色的列(1 表示是,0 表示否)
(2) "不喜欢".
因此我预期的数据框是:
id color wave like dislike
0 01 yellow 1 0 7
1 01 yellow 2 0 7
2 01 red 1 1 1
3 01 red 2 1 1
4 01 blue 1 0 2
5 01 blue 2 0 3
6 02 yellow 1 0 2
7 02 yellow 2 1 1
8 02 red 1 1 1
9 02 red 2 0 2
10 02 blue 1 0 7
11 02 blue 2 0 7
你能帮我解决这个问题吗?非常感谢您的回答!
尝试将两个框架转换为与另一个框架兼容的格式:
DF1
# Get df1 into usable format
df1 = df1.set_index('id')
# Create Multi Index by splitting columns on '_'
df1.columns = df1.columns.str.split('_', expand=True)
# Stack to create long format frame
df1 = df1.stack().reset_index()
# Fix column names to match df2/output
df1.columns = ['id', 'wave', 'color']
# Set like to 1 for these since this table indicates likes
df1['like'] = 1
df1
:
id wave color like
0 01 1 red 1
1 01 2 red 1
2 02 1 red 1
3 02 2 yellow 1
DF2
# Get df2 into usable format
# Set index to ID
df2 = df2.set_index('id')
# Create Multi Index by splitting columns on '_'
df2.columns = df2.columns.str.split('_', expand=True)
# Stack to create long format frame
df2 = df2.stack(level=[1, 2]).reset_index()
# Fix column names to match df1
df2.columns = ['id', 'wave', 'color', 'dislike']
# Turn "wave1" into 1, "wave2" into 2, ... etc.
df2['wave'] = df2['wave'].str.lstrip('wave')
df2
:
id wave color dislike
0 01 1 blue 2
1 01 1 red 1
2 01 1 yellow 7
3 01 2 blue 3
4 01 2 red 1
5 01 2 yellow 7
6 02 1 blue 7
7 02 1 red 1
8 02 1 yellow 2
9 02 2 blue 7
10 02 2 red 2
11 02 2 yellow 1
然后merge
帧在一起:
# Merge On Common Columns
df3 = df1.merge(df2, on=['id', 'wave', 'color'], how='outer')
# Fill empty values in like and dislike with 0 (only 1s in source DF1)
# (Fill dislikes in case there are likes in df1 that are not dislikes in df2)
df3[['like', 'dislike']] = df3[['like', 'dislike']].fillna(0).astype(int)
# Sort Values and fix index (to match output in question)
df3 = df3.sort_values(
['id', 'color'], ascending=[True, False]
).reset_index(drop=True)
df3
:
id wave color like dislike
0 01 1 yellow 0 7
1 01 2 yellow 0 7
2 01 1 red 1 1
3 01 2 red 1 1
4 01 1 blue 0 2
5 01 2 blue 0 3
6 02 1 yellow 0 2
7 02 2 yellow 1 1
8 02 1 red 1 1
9 02 2 red 0 2
10 02 1 blue 0 7
11 02 2 blue 0 7
我们可以使用 pivot_longer from pyjanitor 在合并之前重塑各个数据帧:
left = (df1.pivot_longer('id',
names_to=('.value', 'num'),
names_pattern=r".+_(.+)_(\d$)")
.rename(columns={"wave":"color",
"num":"wave"})
.assign(like = 1)
)
left
id wave color like
0 01 1 red 1
1 02 1 red 1
2 01 2 red 1
3 02 2 yellow 1
right = (df2.pivot_longer('id',
names_to=(".value", "dislike", "color"),
names_pattern = r".+_(.+)(\d)_(.+)",
sort_by_appearance=True)
.rename(columns = {"dislike":"wave", "wave":"dislike"})
)
right
id wave color dislike
0 01 1 yellow 7
1 01 1 red 1
2 01 1 blue 2
3 01 2 yellow 7
4 01 2 red 1
5 01 2 blue 3
6 02 1 yellow 2
7 02 1 red 1
8 02 1 blue 7
9 02 2 yellow 1
10 02 2 red 2
11 02 2 blue 7
right.merge(left, how = 'outer').fillna(0)
id wave color dislike like
0 01 1 yellow 7 0.0
1 01 1 red 1 1.0
2 01 1 blue 2 0.0
3 01 2 yellow 7 0.0
4 01 2 red 1 1.0
5 01 2 blue 3 0.0
6 02 1 yellow 2 0.0
7 02 1 red 1 1.0
8 02 1 blue 7 0.0
9 02 2 yellow 1 1.0
10 02 2 red 2 0.0
11 02 2 blue 7 0.0