增强 Python 性能 pandas 数据帧

Enhance Python performance pandas dataframe

我有两个 pandas 数据框看起来像这样

ID email name
1 "firstname.lastname@provider.com" "firstname lastname"
... ... ...
5150 "firstname.lastname@provider.com" "firstname lastname"

大约有 5150 行。在 name 列可能包含带有拼写错误、额外空格、以驼峰式书写的名称的意义上,数据未被清理。也有可能是名称为空串

第二个数据帧包含这样的信息

Id Email Name To To_Name
1 "firstname.lastname@provider.com" "firstname lastname" "firstname.lastname@provider.com" "firstname lastname"
... ... ...
8500 "firstname.lastname@provider.com" "firstname lastname" "firstname.lastname@provider.com" "firstname lastname"

大约有 8500 行。在这里,name 列与第一个数据框有相同的问题。

我现在想从关系数据库意义上的前两个数据框创建一个新的数据框,即

ID From To
1 1 2
2 4 8

其中 ID 列指的是第二个数据帧的 ID 列, FromTo 列中的值指的是我们在其中的第一个数据帧将名称映射到整数。

下面的代码运行但大约需要一分钟。你们有什么办法可以加快速度吗?

Id_new = []
From_new = []
To_new = []

for i in range(0,len(second_df['Id'])):

    Id_new.append(second_df['Id'].iloc[i])
    email = second_df['Email'].iloc[i]
    name = second_df['Name'].iloc[i]
    testdf = first_Df.where(first_Df['Email'] == email).dropna()
    value = int(testdf.loc[testdf['Name'] == name].iloc[0].at["ID"])
    From_new.append(value)

    emailto = second_df['To'].iloc[i]
    nameto = second_df['To_Name'].iloc[i]
    testdf = first_Df.where(first_Df['Email'] == emailto).dropna()
    valueto = int(testdf.loc[testdf['Name'] == nameto].iloc[0].at["ID"])
    To_new.append(valueto)
        
    return output_df = pd.DataFrame(list(zip(Id_new, From_new, To_new)),
                             columns = ['ID', 'From', 'To'])

我的方法是迭代第一个列表,将名称和 ID 提取到字典中。然后,迭代第二个列表也提取名称和 ID 并丰富先前的字典创建结果 table.

result = {}
for index, row in first_df.iterrows():
   result[row["Name"]] = {"From": row["Id"]}

for index, row in second_df.iterrows():
   if row["Name"] in result:
      result[row["Name"]]["To"] = row["Id"]
   else:
      result[row["Name"]] = {"To": row["Id"]}

这样,您只需在每个数据帧上迭代一次。

在处理 pandas DataFrame 时,你应该尽量避免 运行 for 循环,大多数时候有更好的方法。在这种情况下,您可能想使用 merge Merge, join, concatenate and compare

您可以先在 emailname 上合并,然后在 toto_name 上合并,如下所示:

df1 = pd.DataFrame(
{"ID": ["1", "2", "3"], "email": ["a", "b", "c"], "name": ["x", "y", "z"]}
)

df2 = pd.DataFrame(
{
    "Id": ["1", "2", "3", "4"],
    "email": ["a", "b", "c", "d"],
    "name": ["x", "y", "z", "k"],
    "to": ["m", "a", "b", "p"],
    "to_name": ["r", "x", "y", "u"],
}
)

new_df = (
df2.merge(df1[["ID", "email", "name"]], on=["email", "name"], how="left")
.rename(columns={"ID": "From"})
.merge(df1, right_on=["email", "name"], left_on=["to", "to_name"], how="left")
.rename(columns={"ID": "To"})[["Id", "From", "To"]]
)

您只需使用 replace 即可:

import pandas as pd

df1 = pd.DataFrame([ { "ID": 1, "email": "firstname1.lastname@provider.com", "name": "firstname lastname" }, { "ID": 2, "email": "firstname2.lastname@provider.com", "name": "firstname lastname" } ])
df2 = pd.DataFrame([ { "Id": 1, "Email": "firstname2.lastname@provider.com", "Name": "firstname lastname", "To": "firstname1.lastname@provider.com", "To_Name": "firstname lastname" }, { "Id": 2, "Email": "firstname1.lastname@provider.com", "Name": "firstname lastname", "To": "firstname2.lastname@provider.com", "To_Name": "firstname lastname" } ])

df2[['Email', 'To']] = df2[['Email', 'To']].replace(df1.set_index('email')['ID'])
final_df = df2[['Id', 'Email', 'To']]

输出:

Id Email To
0 1 2 1
1 2 1 2