在 pandas 数据框中合并多个可能重复的字符串列

Merging several string columns with possible duplicates in a pandas dataframe

我正在尝试在我们的两个系统之间迁移数据,一个系统的描述分为多列,而目标系统只有一列。所以我需要将这 5 列合并为一个列,同时删除可能的重复项。

这是我目前所拥有的,它可以工作,但是有没有办法让它更快?现在,迭代我正在处理的 13,000 条记录需要相当长的时间。 (一旦我将来从我们的其他系统添加更多数据,数据很容易达到 30,000 条记录,所以每一秒都很重要)

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
    descriptions = []
    for col in columns:
        if r[col] not in descriptions:
            descriptions.append(r[col])
    description = ""
    for d in descriptions:
        description += "\n" + d
    description = description.strip()
    description_col.append(description)
df["Description"] = description_col

所以我想我的问题真的可以归结为,有没有更好的方法来做到这一点?

编辑: 澄清一下,我必须确保数据在两个系统中都得到维护,但是记录的顺序并不重要,只要每条记录的数据保存在一起即可。

此外,合并描述列的顺序无关紧要,因为大多数记录一次不会在超过 3 个列中包含任何数据。 (大多数数据恰好是 1,但也有相当一部分数据是在 2 或 3 列中)

编辑 2: 根据要求,这里是一些示例数据:

columns = [
    "Item.Name",
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description",
    "Other Data"
]
df = pd.DataFrame([
    ["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
    ["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
    ["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
    ["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)

您可以使用 pandas.unique()https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

这是我的实现,我用 space 替换了你的 \n 字符,所以 table 打印出来时没有大的间隙,只需在你的代码中替换它即可。

import pandas as pd
import numpy as np

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]

rows = [
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['little', 'lamb', 'little', 'lamb', np.nan],
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['whose', 'fleece', 'was', 'white', 'as'],
    ['snow', np.nan, np.nan, np.nan, np.nan]
]

df = pd.DataFrame(data=rows, columns=columns).fillna('')

def merge_row(row):
    return ' '.join(pd.unique(row)).strip()

df['Description'] = list(map(merge_row, df.loc[:,columns].values))
Item.Asset Description Item.Fixed Asset Sales Description Item.Item Description Item.Purchase Description Item.Sales Description Description
0 mary had a little lamb mary had a little lamb
1 little lamb little lamb little lamb
2 mary had a little lamb mary had a little lamb
3 whose fleece was white as whose fleece was white as
4 snow snow