如何匹配字符串并相应地排列数据帧?
How to match string and arrange dataframe accordingly?
获得输入 df1 和 df2
df1:
Subcategory_Desc Segment_Desc Flow Side Row_no
APPLE APPLE LOOSE Apple Kanzi Front Row 1
APPLE APPLE LOOSE Apple Jazz Front Row 1
CITRUS ORANGES LOOSE Orange Navel Front Row 1
PEAR PEARS LOOSE Lemon Right End Row 1
AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Green Left End Row 1
df2:
Subcategory_Desc Segment_Desc Flow
TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi
APPLE APPLE LOOSE Apple GoldenDel
AVOCADOS AVOCADOS LOOSE Avocado Tray
场景:
考虑到以下条件,应将 Dataframe df2 行插入到 dataframe df1:
- 检查 df1 中 df2 的类似 Subcategory_Desc 和 Segment_Desc 并将 df2 行插入到该特定 Side 的结尾(Front/Back)。如预期输出中所示。
- 还需要考虑 Row_no 列,因为原始数据集包含 n 个 Row_no,这里仅给出第 1 行作为样本数据。
预期输出:
Subcategory_Desc Segment_Desc Flow Side Row_no
APPLE APPLE LOOSE Apple Kanzi Front Row 1
APPLE APPLE LOOSE Apple Jazz Front Row 1
CITRUS ORANGES LOOSE Orange Navel Front Row 1
APPLE APPLE LOOSE Apple GoldenDel Front Row 1
PEAR PEARS LOOSE Lemon Right End Row 1
AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi Back Row 1
AVOCADOS AVOCADOS LOOSE Avocado Tray Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Green Left End Row 1
不确定为此目的可以使用什么简单的逻辑。
因此,给定以下数据帧:
import pandas as pd
df1 = pd.DataFrame(
{
"Subcategory_Desc": {
0: "APPLE",
1: "APPLE",
2: "CITRUS",
3: "PEAR",
4: "AVOCADOS",
5: "TROPICAL FRUIT",
6: "TROPICAL FRUIT",
},
"Segment_Desc": {
0: "APPLE LOOSE",
1: "APPLE LOOSE",
2: "ORANGES LOOSE",
3: "PEARS LOOSE",
4: "AVOCADOS LOOSE",
5: "KIWI FRUIT",
6: "KIWI FRUIT",
},
"Flow": {
0: "Apple Kanzi",
1: "Apple Jazz",
2: "Orange Navel",
3: "Lemon",
4: "Avocado",
5: "Kiwi Gold",
6: "Kiwi Green",
},
"Side": {
0: "Front",
1: "Front",
2: "Front",
3: "Right_End",
4: "Back",
5: "Back",
6: "Left_End",
},
"Row_no": {
0: "Row 1",
1: "Row 1",
2: "Row 1",
3: "Row 1",
4: "Row 1",
5: "Row 1",
6: "Row 1",
},
}
)
df2 = pd.DataFrame(
{
"Subcategory_Desc": {0: "TROPICAL FRUIT", 1: "APPLE", 2: "AVOCADOS"},
"Segment_Desc": {0: "KIWI FRUIT", 1: "APPLE LOOSE", 2: "AVOCADOS LOOSE"},
"Flow": {0: "5pk Kids Kiwi", 1: "Apple GoldenDel", 2: "Avocado Tray"},
}
)
你可以试试这个:
# Initialize new column
df2["idx"] = ""
# Find indice of first match in df1
for _, row2 in df2.iterrows():
for i, row1 in df1.iterrows():
if i + 1 >= df1.shape[0]:
break
if (
row1["Subcategory_Desc"] == row2["Subcategory_Desc"]
and row1["Segment_Desc"] == row2["Segment_Desc"]
):
row2["idx"] = i
df2 = df2.sort_values(by="idx").reset_index(drop=True)
# Starting from previous indice, find insertion indice in df1
for i, idx in enumerate(df2["idx"]):
side_of_idx = df1.loc[idx, "Side"]
df2.loc[i, "pos"] = df1.index[df1["Side"] == side_of_idx].to_list()[-1] + 1
positions = df2["pos"].astype("int").to_list()
# Clean up df2
df2 = df2.drop(columns=["idx", "pos"])
df2["Side"] = df2["Row_no"] = ""
# Iterate on df1 to insert new rows
for i, pos in enumerate(positions):
# Fill missing values
df2.loc[i, "Side"] = df1.loc[pos - 1, "Side"]
df2.loc[i, "Row_no"] = df1.loc[pos, "Row_no"]
# Insert row
df1 = pd.concat(
[df1.iloc[:pos], pd.DataFrame([df2.iloc[i]]), df1.iloc[pos:]], ignore_index=True
).reset_index(drop=True)
# Increment next position since df1 has changed
if i < len(positions) - 1:
positions[i + 1] += 1
等等:
print(df1)
# Outputs
Subcategory_Desc Segment_Desc Flow Side Row_no
0 APPLE APPLE LOOSE Apple Kanzi Front Row 1
1 APPLE APPLE LOOSE Apple Jazz Front Row 1
2 CITRUS ORANGES LOOSE Orange Navel Front Row 1
3 APPLE APPLE LOOSE Apple GoldenDel Front Row 1
4 PEAR PEARS LOOSE Lemon Right_End Row 1
5 AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
6 TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
7 TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi Back Row 1
8 AVOCADOS AVOCADOS LOOSE Avocado Tray Back Row 1
9 TROPICAL FRUIT KIWI FRUIT Kiwi Green Left_End Row 1
获得输入 df1 和 df2
df1:
Subcategory_Desc Segment_Desc Flow Side Row_no
APPLE APPLE LOOSE Apple Kanzi Front Row 1
APPLE APPLE LOOSE Apple Jazz Front Row 1
CITRUS ORANGES LOOSE Orange Navel Front Row 1
PEAR PEARS LOOSE Lemon Right End Row 1
AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Green Left End Row 1
df2:
Subcategory_Desc Segment_Desc Flow
TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi
APPLE APPLE LOOSE Apple GoldenDel
AVOCADOS AVOCADOS LOOSE Avocado Tray
场景: 考虑到以下条件,应将 Dataframe df2 行插入到 dataframe df1:
- 检查 df1 中 df2 的类似 Subcategory_Desc 和 Segment_Desc 并将 df2 行插入到该特定 Side 的结尾(Front/Back)。如预期输出中所示。
- 还需要考虑 Row_no 列,因为原始数据集包含 n 个 Row_no,这里仅给出第 1 行作为样本数据。
预期输出:
Subcategory_Desc Segment_Desc Flow Side Row_no
APPLE APPLE LOOSE Apple Kanzi Front Row 1
APPLE APPLE LOOSE Apple Jazz Front Row 1
CITRUS ORANGES LOOSE Orange Navel Front Row 1
APPLE APPLE LOOSE Apple GoldenDel Front Row 1
PEAR PEARS LOOSE Lemon Right End Row 1
AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi Back Row 1
AVOCADOS AVOCADOS LOOSE Avocado Tray Back Row 1
TROPICAL FRUIT KIWI FRUIT Kiwi Green Left End Row 1
不确定为此目的可以使用什么简单的逻辑。
因此,给定以下数据帧:
import pandas as pd
df1 = pd.DataFrame(
{
"Subcategory_Desc": {
0: "APPLE",
1: "APPLE",
2: "CITRUS",
3: "PEAR",
4: "AVOCADOS",
5: "TROPICAL FRUIT",
6: "TROPICAL FRUIT",
},
"Segment_Desc": {
0: "APPLE LOOSE",
1: "APPLE LOOSE",
2: "ORANGES LOOSE",
3: "PEARS LOOSE",
4: "AVOCADOS LOOSE",
5: "KIWI FRUIT",
6: "KIWI FRUIT",
},
"Flow": {
0: "Apple Kanzi",
1: "Apple Jazz",
2: "Orange Navel",
3: "Lemon",
4: "Avocado",
5: "Kiwi Gold",
6: "Kiwi Green",
},
"Side": {
0: "Front",
1: "Front",
2: "Front",
3: "Right_End",
4: "Back",
5: "Back",
6: "Left_End",
},
"Row_no": {
0: "Row 1",
1: "Row 1",
2: "Row 1",
3: "Row 1",
4: "Row 1",
5: "Row 1",
6: "Row 1",
},
}
)
df2 = pd.DataFrame(
{
"Subcategory_Desc": {0: "TROPICAL FRUIT", 1: "APPLE", 2: "AVOCADOS"},
"Segment_Desc": {0: "KIWI FRUIT", 1: "APPLE LOOSE", 2: "AVOCADOS LOOSE"},
"Flow": {0: "5pk Kids Kiwi", 1: "Apple GoldenDel", 2: "Avocado Tray"},
}
)
你可以试试这个:
# Initialize new column
df2["idx"] = ""
# Find indice of first match in df1
for _, row2 in df2.iterrows():
for i, row1 in df1.iterrows():
if i + 1 >= df1.shape[0]:
break
if (
row1["Subcategory_Desc"] == row2["Subcategory_Desc"]
and row1["Segment_Desc"] == row2["Segment_Desc"]
):
row2["idx"] = i
df2 = df2.sort_values(by="idx").reset_index(drop=True)
# Starting from previous indice, find insertion indice in df1
for i, idx in enumerate(df2["idx"]):
side_of_idx = df1.loc[idx, "Side"]
df2.loc[i, "pos"] = df1.index[df1["Side"] == side_of_idx].to_list()[-1] + 1
positions = df2["pos"].astype("int").to_list()
# Clean up df2
df2 = df2.drop(columns=["idx", "pos"])
df2["Side"] = df2["Row_no"] = ""
# Iterate on df1 to insert new rows
for i, pos in enumerate(positions):
# Fill missing values
df2.loc[i, "Side"] = df1.loc[pos - 1, "Side"]
df2.loc[i, "Row_no"] = df1.loc[pos, "Row_no"]
# Insert row
df1 = pd.concat(
[df1.iloc[:pos], pd.DataFrame([df2.iloc[i]]), df1.iloc[pos:]], ignore_index=True
).reset_index(drop=True)
# Increment next position since df1 has changed
if i < len(positions) - 1:
positions[i + 1] += 1
等等:
print(df1)
# Outputs
Subcategory_Desc Segment_Desc Flow Side Row_no
0 APPLE APPLE LOOSE Apple Kanzi Front Row 1
1 APPLE APPLE LOOSE Apple Jazz Front Row 1
2 CITRUS ORANGES LOOSE Orange Navel Front Row 1
3 APPLE APPLE LOOSE Apple GoldenDel Front Row 1
4 PEAR PEARS LOOSE Lemon Right_End Row 1
5 AVOCADOS AVOCADOS LOOSE Avocado Back Row 1
6 TROPICAL FRUIT KIWI FRUIT Kiwi Gold Back Row 1
7 TROPICAL FRUIT KIWI FRUIT 5pk Kids Kiwi Back Row 1
8 AVOCADOS AVOCADOS LOOSE Avocado Tray Back Row 1
9 TROPICAL FRUIT KIWI FRUIT Kiwi Green Left_End Row 1