Pandas - 用新数据删除旧的 NA 行
Pandas - Delete old NA rows with new data
我有一个数据框:
df
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic NaN 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
然后有人通过 UI 添加了一行:
newdata
shelf jar_lid_color jar_material jar_owner size
0 2 blue plastic Oscar 7
导致:
df_new
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic NaN 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
5 2 blue plastic Oscar 7
- 程序没有任何逻辑可以知道添加的数据是新数据还是必须用缺失值替换旧数据
- 使用UI的用户也不知道导入的数据是否已经在数据库中
- NaN 值可以在任何列中,这意味着
drop_duplicates(subset=)
不能使用
在这种情况下,添加的数据需要用不完整的数据替换旧行,导致:
df_new
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic Oscar 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
我总是可以遍历旧数据库中的所有行和新添加的数据以检查它们是否相同,忽略数据具有 NaN 的列,但这对于大型数据帧来说效率低下:
df = df.append(newdata, axis=0)
for index1, row1 in newdata.iterrows():
for index2, row2 in df.iterrows():
cols = row2.index[~row2.isna()].tolist()
if row1.loc[cols].equals(row2.loc[cols]):
df.loc[index2] = row1
df = df.drop_duplicates()
有更快的方法吗?
因此,给定以下两个数据帧:
import pandas as pd
df = pd.DataFrame(
{
"shelf": {0: 1, 1: 1, 2: 2, 3: 2, 4: 1},
"jar_lid_color": {0: "red", 1: pd.NA, 2: "blue", 3: "green", 4: "pink"},
"jar_material": {0: "glass", 1: "glass", 2: "plastic", 3: pd.NA, 4: "plastic"},
"jar_owner": {0: "David", 1: "Bob", 2: pd.NA, 3: "Julia", 4: "Peter"},
"size": {0: 20, 1: 12, 2: 7, 3: 19, 4: 9},
}
)
new_data = pd.DataFrame(
{
"shelf": {0: 2, 1: 1, 2: 2},
"jar_lid_color": {0: "blue", 1: "yellow", 2: "green"},
"jar_material": {0: "plastic", 1: "glass", 2: "iron"},
"jar_owner": {0: "Oscar", 1: "Bob", 2: "Julia"},
"size": {0: 7, 1: 12, 2: 19},
}
)
您的代码提供了一些小的错误修复,平均在 0.012 秒后输出预期结果:
import statistics
import time
def two_iterations(df, new_data):
df = df.append(new_data).reset_index(drop=True)
for _, row1 in new_data.iterrows():
for index2, row2 in df.iterrows():
cols = row2.index[~row2.isna()].tolist()
if row1.loc[cols].equals(row2.loc[cols]):
df.loc[index2] = row1.values
df = df.drop_duplicates(keep="first")
return df
elapsed_time = []
for i in range(500):
start_time = time.time()
new_df = two_iterations(df, new_data)
elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)
# Output
--- 0.012458 seconds in average ---
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 yellow glass Bob 12
2 2 blue plastic Oscar 7
3 2 green iron Julia 19
4 1 pink plastic Peter 9
而您可以通过先附加所有值来跳过一次迭代,然后以更惯用的方式查找重复项,这样可以提供相同结果快 4 倍(平均 0.003 秒):
def faster_way(df, new_data):
df = df.append(new_data).reset_index(drop=True)
temp_df = df.copy()
for i, row in temp_df.iterrows():
if row.isna().sum():
row = row.dropna()
if any(df.loc[~df.index.isin([i]), row.index.tolist()] == row.tolist()):
df = df.drop(index=i)
return df
elapsed_time = []
for i in range(500):
start_time = time.time()
new_df = faster_way(df, new_data)
elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)
# Output
--- 0.003489 seconds in average ---
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
4 1 pink plastic Peter 9
5 2 blue plastic Oscar 7
6 1 yellow glass Bob 12
7 2 green iron Julia 19
我有一个数据框:
df
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic NaN 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
然后有人通过 UI 添加了一行:
newdata
shelf jar_lid_color jar_material jar_owner size
0 2 blue plastic Oscar 7
导致:
df_new
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic NaN 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
5 2 blue plastic Oscar 7
- 程序没有任何逻辑可以知道添加的数据是新数据还是必须用缺失值替换旧数据
- 使用UI的用户也不知道导入的数据是否已经在数据库中
- NaN 值可以在任何列中,这意味着
drop_duplicates(subset=)
不能使用
在这种情况下,添加的数据需要用不完整的数据替换旧行,导致:
df_new
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 NaN glass Bob 12
2 2 blue plastic Oscar 7
3 2 green NaN Julia 19
4 1 pink plastic Peter 9
我总是可以遍历旧数据库中的所有行和新添加的数据以检查它们是否相同,忽略数据具有 NaN 的列,但这对于大型数据帧来说效率低下:
df = df.append(newdata, axis=0)
for index1, row1 in newdata.iterrows():
for index2, row2 in df.iterrows():
cols = row2.index[~row2.isna()].tolist()
if row1.loc[cols].equals(row2.loc[cols]):
df.loc[index2] = row1
df = df.drop_duplicates()
有更快的方法吗?
因此,给定以下两个数据帧:
import pandas as pd
df = pd.DataFrame(
{
"shelf": {0: 1, 1: 1, 2: 2, 3: 2, 4: 1},
"jar_lid_color": {0: "red", 1: pd.NA, 2: "blue", 3: "green", 4: "pink"},
"jar_material": {0: "glass", 1: "glass", 2: "plastic", 3: pd.NA, 4: "plastic"},
"jar_owner": {0: "David", 1: "Bob", 2: pd.NA, 3: "Julia", 4: "Peter"},
"size": {0: 20, 1: 12, 2: 7, 3: 19, 4: 9},
}
)
new_data = pd.DataFrame(
{
"shelf": {0: 2, 1: 1, 2: 2},
"jar_lid_color": {0: "blue", 1: "yellow", 2: "green"},
"jar_material": {0: "plastic", 1: "glass", 2: "iron"},
"jar_owner": {0: "Oscar", 1: "Bob", 2: "Julia"},
"size": {0: 7, 1: 12, 2: 19},
}
)
您的代码提供了一些小的错误修复,平均在 0.012 秒后输出预期结果:
import statistics
import time
def two_iterations(df, new_data):
df = df.append(new_data).reset_index(drop=True)
for _, row1 in new_data.iterrows():
for index2, row2 in df.iterrows():
cols = row2.index[~row2.isna()].tolist()
if row1.loc[cols].equals(row2.loc[cols]):
df.loc[index2] = row1.values
df = df.drop_duplicates(keep="first")
return df
elapsed_time = []
for i in range(500):
start_time = time.time()
new_df = two_iterations(df, new_data)
elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)
# Output
--- 0.012458 seconds in average ---
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
1 1 yellow glass Bob 12
2 2 blue plastic Oscar 7
3 2 green iron Julia 19
4 1 pink plastic Peter 9
而您可以通过先附加所有值来跳过一次迭代,然后以更惯用的方式查找重复项,这样可以提供相同结果快 4 倍(平均 0.003 秒):
def faster_way(df, new_data):
df = df.append(new_data).reset_index(drop=True)
temp_df = df.copy()
for i, row in temp_df.iterrows():
if row.isna().sum():
row = row.dropna()
if any(df.loc[~df.index.isin([i]), row.index.tolist()] == row.tolist()):
df = df.drop(index=i)
return df
elapsed_time = []
for i in range(500):
start_time = time.time()
new_df = faster_way(df, new_data)
elapsed_time.append(time.time() - start_time)
print(f"--- {statistics.mean(elapsed_time):2f} seconds in average ---")
print(new_df)
# Output
--- 0.003489 seconds in average ---
shelf jar_lid_color jar_material jar_owner size
0 1 red glass David 20
4 1 pink plastic Peter 9
5 2 blue plastic Oscar 7
6 1 yellow glass Bob 12
7 2 green iron Julia 19