Pandas

Question

我有一个数据框：

df
    shelf   jar_lid_color  jar_material  jar_owner  size
0   1       red            glass         David      20
1   1       NaN            glass         Bob        12
2   2       blue           plastic       NaN        7
3   2       green          NaN           Julia      19
4   1       pink           plastic       Peter      9

然后有人通过 UI 添加了一行：

newdata
    shelf   jar_lid_color  jar_material  jar_owner  size
0   2       blue           plastic       Oscar      7

导致：

df_new
    shelf   jar_lid_color  jar_material  jar_owner  size
0   1       red            glass         David      20
1   1       NaN            glass         Bob        12
2   2       blue           plastic       NaN        7
3   2       green          NaN           Julia      19
4   1       pink           plastic       Peter      9
5   2       blue           plastic       Oscar      7

程序没有任何逻辑可以知道添加的数据是新数据还是必须用缺失值替换旧数据
使用UI的用户也不知道导入的数据是否已经在数据库中
NaN 值可以在任何列中，这意味着 drop_duplicates(subset=) 不能使用

在这种情况下，添加的数据需要用不完整的数据替换旧行，导致：

df_new
    shelf   jar_lid_color  jar_material  jar_owner  size
0   1       red            glass         David      20
1   1       NaN            glass         Bob        12
2   2       blue           plastic       Oscar      7
3   2       green          NaN           Julia      19
4   1       pink           plastic       Peter      9

我总是可以遍历旧数据库中的所有行和新添加的数据以检查它们是否相同，忽略数据具有 NaN 的列，但这对于大型数据帧来说效率低下：

df = df.append(newdata, axis=0)
for index1, row1 in newdata.iterrows():
    for index2, row2 in df.iterrows():
        cols = row2.index[~row2.isna()].tolist()
        if row1.loc[cols].equals(row2.loc[cols]):
            df.loc[index2] = row1
df = df.drop_duplicates()

有更快的方法吗？

Answer 1

因此，给定以下两个数据帧：

import pandas as pd

df = pd.DataFrame(
    {
        "shelf": {0: 1, 1: 1, 2: 2, 3: 2, 4: 1},
        "jar_lid_color": {0: "red", 1: pd.NA, 2: "blue", 3: "green", 4: "pink"},
        "jar_material": {0: "glass", 1: "glass", 2: "plastic", 3: pd.NA, 4: "plastic"},
        "jar_owner": {0: "David", 1: "Bob", 2: pd.NA, 3: "Julia", 4: "Peter"},
        "size": {0: 20, 1: 12, 2: 7, 3: 19, 4: 9},
    }
)

new_data = pd.DataFrame(
    {
        "shelf": {0: 2, 1: 1, 2: 2},
        "jar_lid_color": {0: "blue", 1: "yellow", 2: "green"},
        "jar_material": {0: "plastic", 1: "glass", 2: "iron"},
        "jar_owner": {0: "Oscar", 1: "Bob", 2: "Julia"},
        "size": {0: 7, 1: 12, 2: 19},
    }
)

您的代码提供了一些小的错误修复，平均在 0.012 秒后输出预期结果:

import statistics
import time

def two_iterations(df, new_data):
    df = df.append(new_data).reset_index(drop=True)
    for _, row1 in new_data.iterrows():
        for index2, row2 in df.iterrows():
            cols = row2.index[~row2.isna()].tolist()
            if row1.loc[cols].equals(row2.loc[cols]):
                df.loc[index2] = row1.values
    df = df.drop_duplicates(keep="first")
    return df

elapsed_time = []
for i in range(500):
    start_time = time.time()
    new_df = two_iterations(df, new_data)
    elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)

# Output
--- 0.012458 seconds in average ---
   shelf jar_lid_color jar_material jar_owner  size
0      1           red        glass     David    20
1      1        yellow        glass       Bob    12
2      2          blue      plastic     Oscar     7
3      2         green         iron     Julia    19
4      1          pink      plastic     Peter     9

而您可以通过先附加所有值来跳过一次迭代，然后以更惯用的方式查找重复项，这样可以提供相同结果快 4 倍（平均 0.003 秒):

def faster_way(df, new_data):
    df = df.append(new_data).reset_index(drop=True)
    temp_df = df.copy()
    for i, row in temp_df.iterrows():
        if row.isna().sum():
            row = row.dropna()
            if any(df.loc[~df.index.isin([i]), row.index.tolist()] == row.tolist()):
                df = df.drop(index=i)
    return df

elapsed_time = []
for i in range(500):
    start_time = time.time()
    new_df = faster_way(df, new_data)
    elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)

# Output
--- 0.003489 seconds in average ---
   shelf jar_lid_color jar_material jar_owner  size
0      1           red        glass     David    20
4      1          pink      plastic     Peter     9
5      2          blue      plastic     Oscar     7
6      1        yellow        glass       Bob    12
7      2         green         iron     Julia    19

Pandas - 用新数据删除旧的 NA 行

Pandas - Delete old NA rows with new data

python

dataframe

python-3.x

jupyter