在 pandas 中提取并组合街道地址和公寓号
Extracting and then combining Street Address and Apartment Numbers in pandas
我很难想出执行以下操作所需的代码。 也有类似的问题,但我不知道如何根据我的需要调整代码。
我有一个 pandas 数据框,其长度超过 100k 行。以下是目前地址和公寓号码的格式:
当前测向:
temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data
col1 col2
0 220 CENTRAL STREET, 50 50
1 165 EAST 66TH ST, RESI RESI
2 106 SPRUCE STREET, 1 nan
3 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A
6 520 PARK LANE DPH60
7 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A
9 520 PARK SOUTH, DPH54 DPH54
10 520 PARK LANE DPH52
11 62 VEST STREET 21F
12 256 FLARIN AVENUE nan
所需的 DF (data1),它添加了 3 个新列以允许以后使用不同级别的粒度:
temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}
data1 = pd.DataFrame(temp1)
data1
col1 col2 building_address apt_no \
0 220 CENTRAL STREET, 50 50 220 CENTRAL STREET 50
1 165 EAST 66TH ST, RESI RESI 165 EAST 66TH ST RESI
2 106 SPRUCE STREET, 1 nan 106 SPRUCE STREET 1
3 14 EAST 67TH STREET nan 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A 200 EAST 1ST STREET 2A
6 520 PARK LANE DPH60 520 PARK LANE DPH60
7 520 PARK LANE DPH56 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A 80 BAY STREET LANDING 1A
9 520 PARK SOUTH, DPH54 DPH54 520 PARK SOUTH DPH54
10 520 PARK LANE DPH52 520 PARK LANE DPH52
11 62 VEST STREET 21F 62 VEST STREET 21F
12 256 FLARIN AVENUE nan 256 FLARIN AVENUE nan
full_address
0 220 CENTRAL STREET, 50
1 165 EAST 66TH ST, RESI
2 106 SPRUCE STREET, 1
3 14 EAST 67TH STREET
4 1131 OGEN AVENUE
5 200 EAST 1ST STREET, 2A
6 520 PARK LANE, DPH60
7 520 PARK LANE, DPH56
8 80 BAY STREET LANDING, 1A
9 520 PARK SOUTH, DPH54
10 520 PARK LANE, DPH52
11 62 VEST STREET, 21F
12 256 FLARIN AVENUE
在现有的DF(数据)中,col1 是一个街道地址,可能包含也可能不包含公寓号。为简单起见,如果有逗号,我假设 col1 下的值将有一个公寓号。逗号后面的部分可以认为是公寓号。
col2 仅包含公寓号。它的列中有 nan。在某些情况下,例如第 5 行,col2 ('2A') 中的公寓号与 col1 中逗号后的部分 ('RU') 不匹配。在其他情况下,例如第 2 行,col2 可能是 nan,但 col1 在逗号后有一个公寓号。
我想要做的是添加 3 个新列(显示在 Desired DF data1 中):
['building_address'] 基本上只包含逗号前的所有内容,因此它会说“220 CENTRAL STREET”,而 col1 会说“220 CENTRAL STREET, 50”
['apt_no'] 将检查是否有 nan。如果有,它将在 col1 中检查逗号后的值。如果检查成功,它将在 col2 中填充该值。因此,例如,在 data1 的第 2 行中,apt_no 将取“1”的值,它是从 col1 中逗号后面的部分获得的。它还会检查col1中逗号后面是否有部分,col2中是否有值,如果不同,则取col2中的值。例如,在第 5 行中,apt_no 的值为“2A”,取自 col2,即使 col1 在逗号后显示 'RU'。最后,如果 col1 中没有逗号并且 col2 是 nan,那么 'apt_no' 仍然是 nan。
['full_address']最后'full address'将['building address']和['apt_no']拼接成1个建筑地址格式的字符串,apt_no (如上所示)。如果 'apt_no' 是 nan,那么 'full address' 将与 'col1'
相同
我已经为此苦苦挣扎了几个小时,但还没有想出办法。感谢观看。
这是给出您想要的结果的代码。我在最后将 apt_no 重置为 null 以匹配您的解决方案。
data['building_address']=data['col1'].str.split(',').str[0]
data['apt_no']=data['col1'].str.split(',').str[1]
data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
#Reset to null
data['apt_no'][data['apt_no']=='']=np.nan
我很难想出执行以下操作所需的代码。
我有一个 pandas 数据框,其长度超过 100k 行。以下是目前地址和公寓号码的格式:
当前测向:
temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data
col1 col2
0 220 CENTRAL STREET, 50 50
1 165 EAST 66TH ST, RESI RESI
2 106 SPRUCE STREET, 1 nan
3 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A
6 520 PARK LANE DPH60
7 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A
9 520 PARK SOUTH, DPH54 DPH54
10 520 PARK LANE DPH52
11 62 VEST STREET 21F
12 256 FLARIN AVENUE nan
所需的 DF (data1),它添加了 3 个新列以允许以后使用不同级别的粒度:
temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}
data1 = pd.DataFrame(temp1)
data1
col1 col2 building_address apt_no \
0 220 CENTRAL STREET, 50 50 220 CENTRAL STREET 50
1 165 EAST 66TH ST, RESI RESI 165 EAST 66TH ST RESI
2 106 SPRUCE STREET, 1 nan 106 SPRUCE STREET 1
3 14 EAST 67TH STREET nan 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A 200 EAST 1ST STREET 2A
6 520 PARK LANE DPH60 520 PARK LANE DPH60
7 520 PARK LANE DPH56 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A 80 BAY STREET LANDING 1A
9 520 PARK SOUTH, DPH54 DPH54 520 PARK SOUTH DPH54
10 520 PARK LANE DPH52 520 PARK LANE DPH52
11 62 VEST STREET 21F 62 VEST STREET 21F
12 256 FLARIN AVENUE nan 256 FLARIN AVENUE nan
full_address
0 220 CENTRAL STREET, 50
1 165 EAST 66TH ST, RESI
2 106 SPRUCE STREET, 1
3 14 EAST 67TH STREET
4 1131 OGEN AVENUE
5 200 EAST 1ST STREET, 2A
6 520 PARK LANE, DPH60
7 520 PARK LANE, DPH56
8 80 BAY STREET LANDING, 1A
9 520 PARK SOUTH, DPH54
10 520 PARK LANE, DPH52
11 62 VEST STREET, 21F
12 256 FLARIN AVENUE
在现有的DF(数据)中,col1 是一个街道地址,可能包含也可能不包含公寓号。为简单起见,如果有逗号,我假设 col1 下的值将有一个公寓号。逗号后面的部分可以认为是公寓号。
col2 仅包含公寓号。它的列中有 nan。在某些情况下,例如第 5 行,col2 ('2A') 中的公寓号与 col1 中逗号后的部分 ('RU') 不匹配。在其他情况下,例如第 2 行,col2 可能是 nan,但 col1 在逗号后有一个公寓号。
我想要做的是添加 3 个新列(显示在 Desired DF data1 中):
['building_address'] 基本上只包含逗号前的所有内容,因此它会说“220 CENTRAL STREET”,而 col1 会说“220 CENTRAL STREET, 50”
['apt_no'] 将检查是否有 nan。如果有,它将在 col1 中检查逗号后的值。如果检查成功,它将在 col2 中填充该值。因此,例如,在 data1 的第 2 行中,apt_no 将取“1”的值,它是从 col1 中逗号后面的部分获得的。它还会检查col1中逗号后面是否有部分,col2中是否有值,如果不同,则取col2中的值。例如,在第 5 行中,apt_no 的值为“2A”,取自 col2,即使 col1 在逗号后显示 'RU'。最后,如果 col1 中没有逗号并且 col2 是 nan,那么 'apt_no' 仍然是 nan。
['full_address']最后'full address'将['building address']和['apt_no']拼接成1个建筑地址格式的字符串,apt_no (如上所示)。如果 'apt_no' 是 nan,那么 'full address' 将与 'col1'
相同我已经为此苦苦挣扎了几个小时,但还没有想出办法。感谢观看。
这是给出您想要的结果的代码。我在最后将 apt_no 重置为 null 以匹配您的解决方案。
data['building_address']=data['col1'].str.split(',').str[0]
data['apt_no']=data['col1'].str.split(',').str[1]
data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
#Reset to null
data['apt_no'][data['apt_no']=='']=np.nan