为什么这个脚本需要这么长时间才能 运行?
Why is this script taking so long to run?
我有一个包含 200,000 行的 csv 文件。我已将其加载到数据框中,并希望使用带有以下脚本的 faker 对其进行匿名化处理:
for i in range(MasterDE1.FirstName.size):
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Female'), ['FirstName','LastName']] = fake.first_name_female(),fake.last_name_female()
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Male'), ['FirstName','LastName']] = fake.first_name_male(),fake.last_name_male()
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Unknown'), ['FirstName','LastName']] = fake.first_name(),fake.last_name()
MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName']
MasterDE1['EmailAddress'] = 'smithandthunder' + str(i+1) + '@gmail.com'
过去 20 分钟已经 运行(我不认为内核已死)。
你可以省略循环:
MasterDE1 = pd.DataFrame({'Gender__pc':['Female','Male','Unknown'],
'FirstName':['s','d','f'],
'LastName': ['d','f','r']})
MasterDE1 = pd.concat([MasterDE1]*3).reset_index(drop=True)
print (MasterDE1)
FirstName Gender__pc LastName
0 s Female d
1 d Male f
2 f Unknown r
3 s Female d
4 d Male f
5 f Unknown r
6 s Female d
7 d Male f
8 f Unknown r
def f1():
return 'first_name_female' + str(np.random.randint(100))
def f2():
return 'last_name_female' + str(np.random.randint(100))
maskfem = (MasterDE1["Gender__pc"] == 'Female')
a = pd.Series(((np.arange(len(MasterDE1.index))) + 1).astype(str))
MasterDE1.loc[maskfem, 'FirstName'] = [f1() for x in np.arange(maskfem.sum())]
MasterDE1.loc[maskfem, 'LastName'] = [f2() for x in np.arange(maskfem.sum())]
MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName']
MasterDE1['EmailAddress'] = 'smithandthunder' + a + '@gmail.com'
print (MasterDE1)
FirstName Gender__pc LastName \
0 first_name_female70 Female last_name_female64
1 d Male f
2 f Unknown r
3 first_name_female6 Female last_name_female67
4 d Male f
5 f Unknown r
6 first_name_female59 Female last_name_female99
7 d Male f
8 f Unknown r
Name EmailAddress
0 first_name_female70 last_name_female64 smithandthunder1@gmail.com
1 d f smithandthunder2@gmail.com
2 f r smithandthunder3@gmail.com
3 first_name_female6 last_name_female67 smithandthunder4@gmail.com
4 d f smithandthunder5@gmail.com
5 f r smithandthunder6@gmail.com
6 first_name_female59 last_name_female99 smithandthunder7@gmail.com
7 d f smithandthunder8@gmail.com
8 f r smithandthunder9@gmail.com
我不知道为什么要花这么长时间,但可能是因为文件的大小。
但是,您可以找到一种方法来监视该循环以了解它是否仍在工作:
signal = 0
for i in range(0,200000):
....
# something going on in the loop
....
# signal the loop
signal += 1
if signal == 50000 or signal == 100000 or signal == 150000:
print('It\'s still going!')
elif signal > 200000:
print('It\'s over 200000 already!')
break # or you can raise an error instead of break (raise RuntimeError)
您可以先生成名称然后分配,而不是在每次迭代中更新 DataFrame:
df = pd.DataFrame({'Gender': np.random.choice(['Female', 'Male', 'Unknown'], p=[0.45, 0.45, 0.1], size=2*10**5),
'First Name': np.nan, 'Last Name': np.nan})
df.head()
Out:
First Name Gender Last Name
0 NaN Female NaN
1 NaN Male NaN
2 NaN Female NaN
3 NaN Male NaN
4 NaN Male NaN
df.shape
Out: (200000, 3)
下面的内容应该会在几分钟内完成:
df.loc[df['Gender']=='Female', ('First Name', 'Last Name')] = [(fake.first_name_female(), fake.last_name_female()) for _ in range(df[df['Gender']=='Female'].shape[0])]
df.loc[df['Gender']=='Male', ('First Name', 'Last Name')] = [(fake.first_name_male(), fake.last_name_male()) for _ in range(df[df['Gender']=='Male'].shape[0])]
df.loc[df['Gender']=='Unknown', ('First Name', 'Last Name')] = [(fake.first_name(), fake.last_name()) for _ in range(df[df['Gender']=='Unknown'].shape[0])]
df.head()
Out:
First Name Gender Last Name
0 Ruth Female Moore
1 Christina Female Jones
2 Lindsey Female Davis
3 Aaron Unknown Watkins
4 Joshua Male Henry
在那之后,像 df['Name'] = df['First Name'] + ' ' + df['Last Name']
这样的事情应该会很快。
我有一个包含 200,000 行的 csv 文件。我已将其加载到数据框中,并希望使用带有以下脚本的 faker 对其进行匿名化处理:
for i in range(MasterDE1.FirstName.size):
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Female'), ['FirstName','LastName']] = fake.first_name_female(),fake.last_name_female()
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Male'), ['FirstName','LastName']] = fake.first_name_male(),fake.last_name_male()
MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Unknown'), ['FirstName','LastName']] = fake.first_name(),fake.last_name()
MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName']
MasterDE1['EmailAddress'] = 'smithandthunder' + str(i+1) + '@gmail.com'
过去 20 分钟已经 运行(我不认为内核已死)。
你可以省略循环:
MasterDE1 = pd.DataFrame({'Gender__pc':['Female','Male','Unknown'],
'FirstName':['s','d','f'],
'LastName': ['d','f','r']})
MasterDE1 = pd.concat([MasterDE1]*3).reset_index(drop=True)
print (MasterDE1)
FirstName Gender__pc LastName
0 s Female d
1 d Male f
2 f Unknown r
3 s Female d
4 d Male f
5 f Unknown r
6 s Female d
7 d Male f
8 f Unknown r
def f1():
return 'first_name_female' + str(np.random.randint(100))
def f2():
return 'last_name_female' + str(np.random.randint(100))
maskfem = (MasterDE1["Gender__pc"] == 'Female')
a = pd.Series(((np.arange(len(MasterDE1.index))) + 1).astype(str))
MasterDE1.loc[maskfem, 'FirstName'] = [f1() for x in np.arange(maskfem.sum())]
MasterDE1.loc[maskfem, 'LastName'] = [f2() for x in np.arange(maskfem.sum())]
MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName']
MasterDE1['EmailAddress'] = 'smithandthunder' + a + '@gmail.com'
print (MasterDE1)
FirstName Gender__pc LastName \
0 first_name_female70 Female last_name_female64
1 d Male f
2 f Unknown r
3 first_name_female6 Female last_name_female67
4 d Male f
5 f Unknown r
6 first_name_female59 Female last_name_female99
7 d Male f
8 f Unknown r
Name EmailAddress
0 first_name_female70 last_name_female64 smithandthunder1@gmail.com
1 d f smithandthunder2@gmail.com
2 f r smithandthunder3@gmail.com
3 first_name_female6 last_name_female67 smithandthunder4@gmail.com
4 d f smithandthunder5@gmail.com
5 f r smithandthunder6@gmail.com
6 first_name_female59 last_name_female99 smithandthunder7@gmail.com
7 d f smithandthunder8@gmail.com
8 f r smithandthunder9@gmail.com
我不知道为什么要花这么长时间,但可能是因为文件的大小。
但是,您可以找到一种方法来监视该循环以了解它是否仍在工作:
signal = 0
for i in range(0,200000):
....
# something going on in the loop
....
# signal the loop
signal += 1
if signal == 50000 or signal == 100000 or signal == 150000:
print('It\'s still going!')
elif signal > 200000:
print('It\'s over 200000 already!')
break # or you can raise an error instead of break (raise RuntimeError)
您可以先生成名称然后分配,而不是在每次迭代中更新 DataFrame:
df = pd.DataFrame({'Gender': np.random.choice(['Female', 'Male', 'Unknown'], p=[0.45, 0.45, 0.1], size=2*10**5),
'First Name': np.nan, 'Last Name': np.nan})
df.head()
Out:
First Name Gender Last Name
0 NaN Female NaN
1 NaN Male NaN
2 NaN Female NaN
3 NaN Male NaN
4 NaN Male NaN
df.shape
Out: (200000, 3)
下面的内容应该会在几分钟内完成:
df.loc[df['Gender']=='Female', ('First Name', 'Last Name')] = [(fake.first_name_female(), fake.last_name_female()) for _ in range(df[df['Gender']=='Female'].shape[0])]
df.loc[df['Gender']=='Male', ('First Name', 'Last Name')] = [(fake.first_name_male(), fake.last_name_male()) for _ in range(df[df['Gender']=='Male'].shape[0])]
df.loc[df['Gender']=='Unknown', ('First Name', 'Last Name')] = [(fake.first_name(), fake.last_name()) for _ in range(df[df['Gender']=='Unknown'].shape[0])]
df.head()
Out:
First Name Gender Last Name
0 Ruth Female Moore
1 Christina Female Jones
2 Lindsey Female Davis
3 Aaron Unknown Watkins
4 Joshua Male Henry
在那之后,像 df['Name'] = df['First Name'] + ' ' + df['Last Name']
这样的事情应该会很快。