在 pandas 数据框中合并多个可能重复的字符串列
Merging several string columns with possible duplicates in a pandas dataframe
我正在尝试在我们的两个系统之间迁移数据,一个系统的描述分为多列,而目标系统只有一列。所以我需要将这 5 列合并为一个列,同时删除可能的重复项。
这是我目前所拥有的,它可以工作,但是有没有办法让它更快?现在,迭代我正在处理的 13,000 条记录需要相当长的时间。 (一旦我将来从我们的其他系统添加更多数据,数据很容易达到 30,000 条记录,所以每一秒都很重要)
columns = [
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
descriptions = []
for col in columns:
if r[col] not in descriptions:
descriptions.append(r[col])
description = ""
for d in descriptions:
description += "\n" + d
description = description.strip()
description_col.append(description)
df["Description"] = description_col
所以我想我的问题真的可以归结为,有没有更好的方法来做到这一点?
编辑:
澄清一下,我必须确保数据在两个系统中都得到维护,但是记录的顺序并不重要,只要每条记录的数据保存在一起即可。
此外,合并描述列的顺序无关紧要,因为大多数记录一次不会在超过 3 个列中包含任何数据。 (大多数数据恰好是 1,但也有相当一部分数据是在 2 或 3 列中)
编辑 2:
根据要求,这里是一些示例数据:
columns = [
"Item.Name",
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description",
"Other Data"
]
df = pd.DataFrame([
["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)
您可以使用 pandas.unique()
:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
这是我的实现,我用 space 替换了你的 \n
字符,所以 table 打印出来时没有大的间隙,只需在你的代码中替换它即可。
import pandas as pd
import numpy as np
columns = [
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description"
]
rows = [
['mary', 'had', 'a', 'little', 'lamb'],
['little', 'lamb', 'little', 'lamb', np.nan],
['mary', 'had', 'a', 'little', 'lamb'],
['whose', 'fleece', 'was', 'white', 'as'],
['snow', np.nan, np.nan, np.nan, np.nan]
]
df = pd.DataFrame(data=rows, columns=columns).fillna('')
def merge_row(row):
return ' '.join(pd.unique(row)).strip()
df['Description'] = list(map(merge_row, df.loc[:,columns].values))
Item.Asset Description
Item.Fixed Asset Sales Description
Item.Item Description
Item.Purchase Description
Item.Sales Description
Description
0
mary
had
a
little
lamb
mary had a little lamb
1
little
lamb
little
lamb
little lamb
2
mary
had
a
little
lamb
mary had a little lamb
3
whose
fleece
was
white
as
whose fleece was white as
4
snow
snow
我正在尝试在我们的两个系统之间迁移数据,一个系统的描述分为多列,而目标系统只有一列。所以我需要将这 5 列合并为一个列,同时删除可能的重复项。
这是我目前所拥有的,它可以工作,但是有没有办法让它更快?现在,迭代我正在处理的 13,000 条记录需要相当长的时间。 (一旦我将来从我们的其他系统添加更多数据,数据很容易达到 30,000 条记录,所以每一秒都很重要)
columns = [
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
descriptions = []
for col in columns:
if r[col] not in descriptions:
descriptions.append(r[col])
description = ""
for d in descriptions:
description += "\n" + d
description = description.strip()
description_col.append(description)
df["Description"] = description_col
所以我想我的问题真的可以归结为,有没有更好的方法来做到这一点?
编辑: 澄清一下,我必须确保数据在两个系统中都得到维护,但是记录的顺序并不重要,只要每条记录的数据保存在一起即可。
此外,合并描述列的顺序无关紧要,因为大多数记录一次不会在超过 3 个列中包含任何数据。 (大多数数据恰好是 1,但也有相当一部分数据是在 2 或 3 列中)
编辑 2: 根据要求,这里是一些示例数据:
columns = [
"Item.Name",
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description",
"Other Data"
]
df = pd.DataFrame([
["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)
您可以使用 pandas.unique()
:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
这是我的实现,我用 space 替换了你的 \n
字符,所以 table 打印出来时没有大的间隙,只需在你的代码中替换它即可。
import pandas as pd
import numpy as np
columns = [
"Item.Asset Description",
"Item.Fixed Asset Sales Description",
"Item.Item Description",
"Item.Purchase Description",
"Item.Sales Description"
]
rows = [
['mary', 'had', 'a', 'little', 'lamb'],
['little', 'lamb', 'little', 'lamb', np.nan],
['mary', 'had', 'a', 'little', 'lamb'],
['whose', 'fleece', 'was', 'white', 'as'],
['snow', np.nan, np.nan, np.nan, np.nan]
]
df = pd.DataFrame(data=rows, columns=columns).fillna('')
def merge_row(row):
return ' '.join(pd.unique(row)).strip()
df['Description'] = list(map(merge_row, df.loc[:,columns].values))
Item.Asset Description | Item.Fixed Asset Sales Description | Item.Item Description | Item.Purchase Description | Item.Sales Description | Description | |
---|---|---|---|---|---|---|
0 | mary | had | a | little | lamb | mary had a little lamb |
1 | little | lamb | little | lamb | little lamb | |
2 | mary | had | a | little | lamb | mary had a little lamb |
3 | whose | fleece | was | white | as | whose fleece was white as |
4 | snow | snow |