通过正则表达式 str.extract() 从数据框中的完整地址列获取邮政编码并添加为 pandas 中的新列
Get postal code from full address column in dataframe by regex str.extract() and add as new column in pandas
我有一个列中包含完整地址的数据框,我需要创建一个单独的列,其中仅包含同一数据框中以 7 开头的 5 位邮政编码。部分地址可能为空或找不到邮政编码。
如何拆分列以仅获取邮政编码?
邮政编码以 7 开头,例如 76000 是索引 0
中的邮政编码
MedicalCenters["Postcode"][0]
Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
示例数据
Venue Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora 20.595621 -100.392677 Medical Center Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
我尝试使用正则表达式,但出现错误
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
1 # get zipcode from full address
2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1950 )
1951 raise TypeError(msg)
-> 1952 return func(self, *args, **kwargs)
1953
1954 wrapper.__name__ = func_name
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
3037 @forbid_nonstring_types(["bytes"])
3038 def extract(self, pat, flags=0, expand=True):
-> 3039 return str_extract(self, pat, flags=flags, expand=expand)
3040
3041 @copy(str_extractall)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
1010 return _str_extract_frame(arr._orig, pat, flags=flags)
1011 else:
-> 1012 result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
1013 return arr._wrap_result(result, name=name, expand=expand)
1014
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
871
872 regex = re.compile(pat, flags=flags)
--> 873 groups_or_na = _groups_or_na_fun(regex)
874
875 if regex.groups == 1:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
835 """Used in both extract_noexpand and extract_frame"""
836 if regex.groups == 0:
--> 837 raise ValueError("pattern contains no capture groups")
838 empty_row = [np.nan] * regex.groups
839
ValueError: pattern contains no capture groups
time: 39.5 ms
您可以尝试先拆分字符串,这样匹配邮政编码会更容易:
address = '75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0'
matches = list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', '))) # ['76000']
因此您可以通过以下方式填充您的 DataFrame:
df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', ')))[0])
您需要添加括号才能使其成为一个组
MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
地址数据是一个对象,这就是正则表达式不起作用的原因
MedicalCenters.dtypes
Venue object
Venue Latitude float64
Venue Longitude float64
Venue Category object
Health System object
geom object
Address object
Postcode object
dtype: object
time: 6.41 ms
将对象转换为字符串后:
MedicalCenters['Address'] = MedicalCenters['Address'].astype('str')
感谢 glam
,我能够应用修改后的正则表达式
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
我有一个列中包含完整地址的数据框,我需要创建一个单独的列,其中仅包含同一数据框中以 7 开头的 5 位邮政编码。部分地址可能为空或找不到邮政编码。
如何拆分列以仅获取邮政编码? 邮政编码以 7 开头,例如 76000 是索引 0
中的邮政编码MedicalCenters["Postcode"][0]
Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
示例数据
Venue Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora 20.595621 -100.392677 Medical Center Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
我尝试使用正则表达式,但出现错误
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
1 # get zipcode from full address
2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1950 )
1951 raise TypeError(msg)
-> 1952 return func(self, *args, **kwargs)
1953
1954 wrapper.__name__ = func_name
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
3037 @forbid_nonstring_types(["bytes"])
3038 def extract(self, pat, flags=0, expand=True):
-> 3039 return str_extract(self, pat, flags=flags, expand=expand)
3040
3041 @copy(str_extractall)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
1010 return _str_extract_frame(arr._orig, pat, flags=flags)
1011 else:
-> 1012 result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
1013 return arr._wrap_result(result, name=name, expand=expand)
1014
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
871
872 regex = re.compile(pat, flags=flags)
--> 873 groups_or_na = _groups_or_na_fun(regex)
874
875 if regex.groups == 1:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
835 """Used in both extract_noexpand and extract_frame"""
836 if regex.groups == 0:
--> 837 raise ValueError("pattern contains no capture groups")
838 empty_row = [np.nan] * regex.groups
839
ValueError: pattern contains no capture groups
time: 39.5 ms
您可以尝试先拆分字符串,这样匹配邮政编码会更容易:
address = '75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0'
matches = list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', '))) # ['76000']
因此您可以通过以下方式填充您的 DataFrame:
df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', ')))[0])
您需要添加括号才能使其成为一个组
MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
地址数据是一个对象,这就是正则表达式不起作用的原因
MedicalCenters.dtypes
Venue object
Venue Latitude float64
Venue Longitude float64
Venue Category object
Health System object
geom object
Address object
Postcode object
dtype: object
time: 6.41 ms
将对象转换为字符串后:
MedicalCenters['Address'] = MedicalCenters['Address'].astype('str')
感谢 glam
,我能够应用修改后的正则表达式# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")