增强 Python 性能 pandas 数据帧
Enhance Python performance pandas dataframe
我有两个 pandas 数据框看起来像这样
ID
email
name
1
"firstname.lastname@provider.com"
"firstname lastname"
...
...
...
5150
"firstname.lastname@provider.com"
"firstname lastname"
大约有 5150 行。在 name
列可能包含带有拼写错误、额外空格、以驼峰式书写的名称的意义上,数据未被清理。也有可能是名称为空串
第二个数据帧包含这样的信息
Id
Email
Name
To
To_Name
1
"firstname.lastname@provider.com"
"firstname lastname"
"firstname.lastname@provider.com"
"firstname lastname"
...
...
...
8500
"firstname.lastname@provider.com"
"firstname lastname"
"firstname.lastname@provider.com"
"firstname lastname"
大约有 8500 行。在这里,name
列与第一个数据框有相同的问题。
我现在想从关系数据库意义上的前两个数据框创建一个新的数据框,即
ID
From
To
1
1
2
2
4
8
其中 ID
列指的是第二个数据帧的 ID
列, From
和 To
列中的值指的是我们在其中的第一个数据帧将名称映射到整数。
下面的代码运行但大约需要一分钟。你们有什么办法可以加快速度吗?
Id_new = []
From_new = []
To_new = []
for i in range(0,len(second_df['Id'])):
Id_new.append(second_df['Id'].iloc[i])
email = second_df['Email'].iloc[i]
name = second_df['Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == email).dropna()
value = int(testdf.loc[testdf['Name'] == name].iloc[0].at["ID"])
From_new.append(value)
emailto = second_df['To'].iloc[i]
nameto = second_df['To_Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == emailto).dropna()
valueto = int(testdf.loc[testdf['Name'] == nameto].iloc[0].at["ID"])
To_new.append(valueto)
return output_df = pd.DataFrame(list(zip(Id_new, From_new, To_new)),
columns = ['ID', 'From', 'To'])
我的方法是迭代第一个列表,将名称和 ID 提取到字典中。然后,迭代第二个列表也提取名称和 ID 并丰富先前的字典创建结果 table.
result = {}
for index, row in first_df.iterrows():
result[row["Name"]] = {"From": row["Id"]}
for index, row in second_df.iterrows():
if row["Name"] in result:
result[row["Name"]]["To"] = row["Id"]
else:
result[row["Name"]] = {"To": row["Id"]}
这样,您只需在每个数据帧上迭代一次。
在处理 pandas DataFrame 时,你应该尽量避免 运行 for 循环,大多数时候有更好的方法。在这种情况下,您可能想使用 merge
Merge, join, concatenate and compare
您可以先在 email
和 name
上合并,然后在 to
和 to_name
上合并,如下所示:
df1 = pd.DataFrame(
{"ID": ["1", "2", "3"], "email": ["a", "b", "c"], "name": ["x", "y", "z"]}
)
df2 = pd.DataFrame(
{
"Id": ["1", "2", "3", "4"],
"email": ["a", "b", "c", "d"],
"name": ["x", "y", "z", "k"],
"to": ["m", "a", "b", "p"],
"to_name": ["r", "x", "y", "u"],
}
)
new_df = (
df2.merge(df1[["ID", "email", "name"]], on=["email", "name"], how="left")
.rename(columns={"ID": "From"})
.merge(df1, right_on=["email", "name"], left_on=["to", "to_name"], how="left")
.rename(columns={"ID": "To"})[["Id", "From", "To"]]
)
您只需使用 replace 即可:
import pandas as pd
df1 = pd.DataFrame([ { "ID": 1, "email": "firstname1.lastname@provider.com", "name": "firstname lastname" }, { "ID": 2, "email": "firstname2.lastname@provider.com", "name": "firstname lastname" } ])
df2 = pd.DataFrame([ { "Id": 1, "Email": "firstname2.lastname@provider.com", "Name": "firstname lastname", "To": "firstname1.lastname@provider.com", "To_Name": "firstname lastname" }, { "Id": 2, "Email": "firstname1.lastname@provider.com", "Name": "firstname lastname", "To": "firstname2.lastname@provider.com", "To_Name": "firstname lastname" } ])
df2[['Email', 'To']] = df2[['Email', 'To']].replace(df1.set_index('email')['ID'])
final_df = df2[['Id', 'Email', 'To']]
输出:
Id
Email
To
0
1
2
1
1
2
1
2
我有两个 pandas 数据框看起来像这样
ID | name | |
---|---|---|
1 | "firstname.lastname@provider.com" | "firstname lastname" |
... | ... | ... |
5150 | "firstname.lastname@provider.com" | "firstname lastname" |
大约有 5150 行。在 name
列可能包含带有拼写错误、额外空格、以驼峰式书写的名称的意义上,数据未被清理。也有可能是名称为空串
第二个数据帧包含这样的信息
Id | Name | To | To_Name | |
---|---|---|---|---|
1 | "firstname.lastname@provider.com" | "firstname lastname" | "firstname.lastname@provider.com" | "firstname lastname" |
... | ... | ... | ||
8500 | "firstname.lastname@provider.com" | "firstname lastname" | "firstname.lastname@provider.com" | "firstname lastname" |
大约有 8500 行。在这里,name
列与第一个数据框有相同的问题。
我现在想从关系数据库意义上的前两个数据框创建一个新的数据框,即
ID | From | To |
---|---|---|
1 | 1 | 2 |
2 | 4 | 8 |
其中 ID
列指的是第二个数据帧的 ID
列, From
和 To
列中的值指的是我们在其中的第一个数据帧将名称映射到整数。
下面的代码运行但大约需要一分钟。你们有什么办法可以加快速度吗?
Id_new = []
From_new = []
To_new = []
for i in range(0,len(second_df['Id'])):
Id_new.append(second_df['Id'].iloc[i])
email = second_df['Email'].iloc[i]
name = second_df['Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == email).dropna()
value = int(testdf.loc[testdf['Name'] == name].iloc[0].at["ID"])
From_new.append(value)
emailto = second_df['To'].iloc[i]
nameto = second_df['To_Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == emailto).dropna()
valueto = int(testdf.loc[testdf['Name'] == nameto].iloc[0].at["ID"])
To_new.append(valueto)
return output_df = pd.DataFrame(list(zip(Id_new, From_new, To_new)),
columns = ['ID', 'From', 'To'])
我的方法是迭代第一个列表,将名称和 ID 提取到字典中。然后,迭代第二个列表也提取名称和 ID 并丰富先前的字典创建结果 table.
result = {}
for index, row in first_df.iterrows():
result[row["Name"]] = {"From": row["Id"]}
for index, row in second_df.iterrows():
if row["Name"] in result:
result[row["Name"]]["To"] = row["Id"]
else:
result[row["Name"]] = {"To": row["Id"]}
这样,您只需在每个数据帧上迭代一次。
在处理 pandas DataFrame 时,你应该尽量避免 运行 for 循环,大多数时候有更好的方法。在这种情况下,您可能想使用 merge
Merge, join, concatenate and compare
您可以先在 email
和 name
上合并,然后在 to
和 to_name
上合并,如下所示:
df1 = pd.DataFrame(
{"ID": ["1", "2", "3"], "email": ["a", "b", "c"], "name": ["x", "y", "z"]}
)
df2 = pd.DataFrame(
{
"Id": ["1", "2", "3", "4"],
"email": ["a", "b", "c", "d"],
"name": ["x", "y", "z", "k"],
"to": ["m", "a", "b", "p"],
"to_name": ["r", "x", "y", "u"],
}
)
new_df = (
df2.merge(df1[["ID", "email", "name"]], on=["email", "name"], how="left")
.rename(columns={"ID": "From"})
.merge(df1, right_on=["email", "name"], left_on=["to", "to_name"], how="left")
.rename(columns={"ID": "To"})[["Id", "From", "To"]]
)
您只需使用 replace 即可:
import pandas as pd
df1 = pd.DataFrame([ { "ID": 1, "email": "firstname1.lastname@provider.com", "name": "firstname lastname" }, { "ID": 2, "email": "firstname2.lastname@provider.com", "name": "firstname lastname" } ])
df2 = pd.DataFrame([ { "Id": 1, "Email": "firstname2.lastname@provider.com", "Name": "firstname lastname", "To": "firstname1.lastname@provider.com", "To_Name": "firstname lastname" }, { "Id": 2, "Email": "firstname1.lastname@provider.com", "Name": "firstname lastname", "To": "firstname2.lastname@provider.com", "To_Name": "firstname lastname" } ])
df2[['Email', 'To']] = df2[['Email', 'To']].replace(df1.set_index('email')['ID'])
final_df = df2[['Id', 'Email', 'To']]
输出:
Id | To | ||
---|---|---|---|
0 | 1 | 2 | 1 |
1 | 2 | 1 | 2 |