根据 Python 中的电子邮件地址和姓名查找最常见的电子邮件模式
Find most common email patterns based on email addresses and names in Python
给定如下数据框:
firstname lastname email \
0 Kieron Futter kieron.futter@ascendishealth.com
1 Vinsonn Law vinsonn.law@carestream.com
2 Rayan Vanderhoof rayan.vanderhoof@olympus.com
3 Andy Joiner andy@tepha.com
4 Christine Nichols cnichols@prosetta.com
5 Bo Smith bsmith@innoviveinc.com
6 Rebecca Ford rford@catholiccharitiesswks.org
7 Fatima Sheikh fatima@broomestreetsociety.com
8 Zack Scriven zack.scriven@soffaelectric.com
9 Bara Alomari baraa@playhut.com
companyname
0 ascendishealth.com
1 Carestream
2 Olympus America Inc.
3 Tepha Inc.
4 Prosetta Corp.
5 Innovive, Inc.
6 catholiccharitiesswks.org
7 broomestreetsociety
8 soffaelectric
9 playhut.com
我如何找到前 3 种最常见的电子邮件模式(first@example.com、firstlast@example.com、first.last@example.com、last@example.com,first@example.com,f.last@example.com,lastF@example.com,first_last@example.com,firstL@example.com ) 通过将 email
列中的值与 firstname
和 lastname
列进行比较?
我已经使用 df['name_email'] = df.email.str.split('@', expand = True)[0]
从电子邮件地址中提取姓名。
输出:
0 douglas.watson
1 nick.holekamp
2 rob.schriener
3 austin.phillips
4 egeiger
...
995 thanley
996 cmarks
997 darryl.rickner
998 lalit
999 parul.dutt
谢谢。
编辑:
@Stef 的代码引发的错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2947 # if kind==mergesort, it can fail for object dtype
-> 2948 return arr.argsort(kind=kind)
2949 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-25-a939f85d610f> in <module>
7 df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
8
----> 9 print(df.iloc[:,4:].sum().sort_values(ascending=False))
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in sort_values(self, axis, ascending, inplace, kind, na_position, ignore_index)
2960 idx = ibase.default_index(len(self))
2961
-> 2962 argsorted = _try_kind_sort(arr[good])
2963
2964 if is_list_like(ascending):
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2950 # stable sort not available for object dtype
2951 # uses the argsort default quicksort
-> 2952 return arr.argsort(kind="quicksort")
2953
2954 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
df.iloc[:,4:].info()
的结果
输出:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9937 entries, 0 to 9999
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 companyname 9937 non-null object
1 industry 9937 non-null object
2 level 9624 non-null object
3 primarydomain 9937 non-null object
4 twitterid 225 non-null object
5 facebookid 11 non-null object
6 linkedinid 2564 non-null object
7 industry.1 9937 non-null object
8 companysize 7538 non-null object
9 companyrevenue 7596 non-null object
10 city 8773 non-null object
11 state 7936 non-null object
12 dept 8865 non-null object
13 phonenumber 41 non-null object
14 net_name 9937 non-null object
15 domain_name 9937 non-null object
16 email 9937 non-null object
17 name_email1 9937 non-null object
18 name_email 9937 non-null object
19 first 9937 non-null bool
20 firstlast 9937 non-null bool
21 first.last 9937 non-null bool
22 last 9937 non-null bool
23 f.last 9937 non-null bool
dtypes: bool(5), object(19)
memory usage: 1.9+ MB
您可以为所有可能的组合添加列,然后计算命中数:
import pandas as pd
df = pd.DataFrame({ 'firstname': ['Kieron', 'Vinsonn', 'Rayan', 'Andy', 'Christine', 'Bo', 'Rebecca', 'Fatima', 'Zack', 'Bara'],
'lastname': ['Futter', 'Law', 'Vanderhoof', 'Joiner', 'Nichols', 'Smith', 'Ford', 'Sheikh', 'Scriven', 'Alomari'],
'email': ['kieron.futter@ascendishealth.com', 'vinsonn.law@carestream.com', 'rayan.vanderhoof@olympus.com', 'andy@tepha.com', 'cnichols@prosetta.com', 'bsmith@innoviveinc.com', 'rford@catholiccharitiesswks.org', 'fatima@broomestreetsociety.com', 'zack.scriven@soffaelectric.com', 'baraa@playhut.com']})
df['name_email'] = df.email.str.lower().str.split('@', expand = True)[0]
df['first'] = df.firstname.str.lower() == df.name_email
df['firstlast'] = df.firstname.str.lower() + df.lastname.str.lower() == df.name_email
df['first.last'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() == df.name_email
df['last'] = df.lastname.str.lower() == df.name_email
df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
# ... etc. ...
print(df.iloc[:,4:].sum().sort_values(ascending=False))
结果:
first.last 4
first 2
f.last 1
last 0
firstlast 0
要使其独立于新增列的位置,还可以使用:
df.select_dtypes(include='bool').sum().sort_values(ascending=False)
给定如下数据框:
firstname lastname email \
0 Kieron Futter kieron.futter@ascendishealth.com
1 Vinsonn Law vinsonn.law@carestream.com
2 Rayan Vanderhoof rayan.vanderhoof@olympus.com
3 Andy Joiner andy@tepha.com
4 Christine Nichols cnichols@prosetta.com
5 Bo Smith bsmith@innoviveinc.com
6 Rebecca Ford rford@catholiccharitiesswks.org
7 Fatima Sheikh fatima@broomestreetsociety.com
8 Zack Scriven zack.scriven@soffaelectric.com
9 Bara Alomari baraa@playhut.com
companyname
0 ascendishealth.com
1 Carestream
2 Olympus America Inc.
3 Tepha Inc.
4 Prosetta Corp.
5 Innovive, Inc.
6 catholiccharitiesswks.org
7 broomestreetsociety
8 soffaelectric
9 playhut.com
我如何找到前 3 种最常见的电子邮件模式(first@example.com、firstlast@example.com、first.last@example.com、last@example.com,first@example.com,f.last@example.com,lastF@example.com,first_last@example.com,firstL@example.com ) 通过将 email
列中的值与 firstname
和 lastname
列进行比较?
我已经使用 df['name_email'] = df.email.str.split('@', expand = True)[0]
从电子邮件地址中提取姓名。
输出:
0 douglas.watson
1 nick.holekamp
2 rob.schriener
3 austin.phillips
4 egeiger
...
995 thanley
996 cmarks
997 darryl.rickner
998 lalit
999 parul.dutt
谢谢。
编辑:
@Stef 的代码引发的错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2947 # if kind==mergesort, it can fail for object dtype
-> 2948 return arr.argsort(kind=kind)
2949 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-25-a939f85d610f> in <module>
7 df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
8
----> 9 print(df.iloc[:,4:].sum().sort_values(ascending=False))
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in sort_values(self, axis, ascending, inplace, kind, na_position, ignore_index)
2960 idx = ibase.default_index(len(self))
2961
-> 2962 argsorted = _try_kind_sort(arr[good])
2963
2964 if is_list_like(ascending):
/usr/local/lib/python3.7/site-packages/pandas/core/series.py in _try_kind_sort(arr)
2950 # stable sort not available for object dtype
2951 # uses the argsort default quicksort
-> 2952 return arr.argsort(kind="quicksort")
2953
2954 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
df.iloc[:,4:].info()
输出:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9937 entries, 0 to 9999
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 companyname 9937 non-null object
1 industry 9937 non-null object
2 level 9624 non-null object
3 primarydomain 9937 non-null object
4 twitterid 225 non-null object
5 facebookid 11 non-null object
6 linkedinid 2564 non-null object
7 industry.1 9937 non-null object
8 companysize 7538 non-null object
9 companyrevenue 7596 non-null object
10 city 8773 non-null object
11 state 7936 non-null object
12 dept 8865 non-null object
13 phonenumber 41 non-null object
14 net_name 9937 non-null object
15 domain_name 9937 non-null object
16 email 9937 non-null object
17 name_email1 9937 non-null object
18 name_email 9937 non-null object
19 first 9937 non-null bool
20 firstlast 9937 non-null bool
21 first.last 9937 non-null bool
22 last 9937 non-null bool
23 f.last 9937 non-null bool
dtypes: bool(5), object(19)
memory usage: 1.9+ MB
您可以为所有可能的组合添加列,然后计算命中数:
import pandas as pd
df = pd.DataFrame({ 'firstname': ['Kieron', 'Vinsonn', 'Rayan', 'Andy', 'Christine', 'Bo', 'Rebecca', 'Fatima', 'Zack', 'Bara'],
'lastname': ['Futter', 'Law', 'Vanderhoof', 'Joiner', 'Nichols', 'Smith', 'Ford', 'Sheikh', 'Scriven', 'Alomari'],
'email': ['kieron.futter@ascendishealth.com', 'vinsonn.law@carestream.com', 'rayan.vanderhoof@olympus.com', 'andy@tepha.com', 'cnichols@prosetta.com', 'bsmith@innoviveinc.com', 'rford@catholiccharitiesswks.org', 'fatima@broomestreetsociety.com', 'zack.scriven@soffaelectric.com', 'baraa@playhut.com']})
df['name_email'] = df.email.str.lower().str.split('@', expand = True)[0]
df['first'] = df.firstname.str.lower() == df.name_email
df['firstlast'] = df.firstname.str.lower() + df.lastname.str.lower() == df.name_email
df['first.last'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() == df.name_email
df['last'] = df.lastname.str.lower() == df.name_email
df['f.last'] = df.firstname.str.lower()[0] + '.' + df.lastname.str.lower() == df.name_email
# ... etc. ...
print(df.iloc[:,4:].sum().sort_values(ascending=False))
结果:
first.last 4
first 2
f.last 1
last 0
firstlast 0
要使其独立于新增列的位置,还可以使用:
df.select_dtypes(include='bool').sum().sort_values(ascending=False)