将子数据框列合并到 None 值 pandas 中的父数据框
Merge child data frame columns to parent data frame with None values in pandas
我有一个这样的 pandas 数据框
已编辑
Promotion ID
Month
Products
PID-1
June
Refer below for sample1
PID-2
July
Refer below for sample2
示例 1:
|产品编号|
|--|
|产品1|
|PROD2|
示例 2:
|产品编号|
|--|
|产品1|
|产品2|
|PROD3|
我想将此数据框转换为以下内容
Promotion ID
Month
Products
PID-1
June
PROD1
PROD2
PID-2
July
PROD 1
PROD2
PROD3
空格只能是None
或NA
值。有没有办法在 pandas 中执行此操作而无需遍历行?
您可以使用 explode
像这样展平您的数据框:
#generating data
df = pd.DataFrame([
['pid-1', 'June', '| Product Id| |PROD1| |PROD2|'],
['pid-2', 'July', '| Product Id| |PROD1| |PROD2| |PROD3|']
], columns = ['Promotion ID', 'Month', 'Products'])
# extracting the product list
df['Products'] = df['Products']\
.apply(lambda s: [x for x in re.split(' *\| *', s) if x != '' and x != 'Product Id'])
exploded_df = exploded_df = df.explode('Products', ignore_index=True)
此时 df
和 exploded_df
看起来像这样:
# df
Promotion ID Month Products
0 pid-1 June [PROD1, PROD2]
1 pid-2 July [PROD1, PROD2, PROD3]
# exploded_df
Promotion ID Month Products
0 pid-1 June PROD1
1 pid-1 June PROD2
2 pid-2 July PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3
我会到此为止。恕我直言,只保留第一行的 Month
和 Promotion ID
的值只会让你更难喜欢。然而,由于您要求您可以使用 rank
和 loc
将 None
分配给所有不是第一组的行:
# rank needs a numeric column
exploded_df['index'] = exploded_df.index
# using rank to create a filter on rows that are not the first of their group
filter = exploded_df\
.groupby(['Promotion ID'])['index']\
.rank('dense').apply(lambda x: x > 1)
# getting rid of the index column
exploded_df = exploded_df.drop('index', axis=1)
# and voila
exploded_df.loc[filter, ['Month', 'Promotion ID']] = None
结果:
Promotion ID Month Products
0 None None PROD1
1 pid-1 June PROD2
2 None None PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3
我有一个这样的 pandas 数据框
已编辑
Promotion ID | Month | Products |
---|---|---|
PID-1 | June | Refer below for sample1 |
PID-2 | July | Refer below for sample2 |
示例 1: |产品编号| |--| |产品1| |PROD2|
示例 2: |产品编号| |--| |产品1| |产品2| |PROD3|
我想将此数据框转换为以下内容
Promotion ID | Month | Products |
---|---|---|
PID-1 | June | PROD1 |
PROD2 | ||
PID-2 | July | PROD 1 |
PROD2 | ||
PROD3 |
空格只能是None
或NA
值。有没有办法在 pandas 中执行此操作而无需遍历行?
您可以使用 explode
像这样展平您的数据框:
#generating data
df = pd.DataFrame([
['pid-1', 'June', '| Product Id| |PROD1| |PROD2|'],
['pid-2', 'July', '| Product Id| |PROD1| |PROD2| |PROD3|']
], columns = ['Promotion ID', 'Month', 'Products'])
# extracting the product list
df['Products'] = df['Products']\
.apply(lambda s: [x for x in re.split(' *\| *', s) if x != '' and x != 'Product Id'])
exploded_df = exploded_df = df.explode('Products', ignore_index=True)
此时 df
和 exploded_df
看起来像这样:
# df
Promotion ID Month Products
0 pid-1 June [PROD1, PROD2]
1 pid-2 July [PROD1, PROD2, PROD3]
# exploded_df
Promotion ID Month Products
0 pid-1 June PROD1
1 pid-1 June PROD2
2 pid-2 July PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3
我会到此为止。恕我直言,只保留第一行的 Month
和 Promotion ID
的值只会让你更难喜欢。然而,由于您要求您可以使用 rank
和 loc
将 None
分配给所有不是第一组的行:
# rank needs a numeric column
exploded_df['index'] = exploded_df.index
# using rank to create a filter on rows that are not the first of their group
filter = exploded_df\
.groupby(['Promotion ID'])['index']\
.rank('dense').apply(lambda x: x > 1)
# getting rid of the index column
exploded_df = exploded_df.drop('index', axis=1)
# and voila
exploded_df.loc[filter, ['Month', 'Promotion ID']] = None
结果:
Promotion ID Month Products
0 None None PROD1
1 pid-1 June PROD2
2 None None PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3