如何使用 pandas 函数 extract() 从地址中提取邮政编码?
How to extract zip code from address using pandas function extract()?
我需要将邮政编码(仅邮政编码)提取到新列中以供进一步分析。我主要在数据清理阶段使用 pandas。我之前尝试使用此代码:
import pandas as pd
df_participant = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')
print (df_participant[['address','postal_code']].head())
但是没用
这是输出:
非常感谢任何帮助!谢谢
如果所有邮政编码都是 5 个字符长,那么以下内容可能会有所帮助。
改变
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')
到
df_participant['postal_code'] = df_participant['address'].str[-5::]
我端收到输出
address postal_code
0 Gg. Monginsidi No. 08\nMedan, Aceh 80734 80734
1 Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434 09434
2 Jalan Kebonjati No. 0\nAmbon, SS 57739 57739
3 Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156 76156
与str.extract
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d{5})')
#OR if the length of the postal code changes, just make it \d+ combined with "$"
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d+)$')
但你在这里不需要它。只取字符串的最后 5 位数字,因为邮政编码总是在末尾。
df_participant['postal_code'] = df_participant['address'].str[-5:]
您可以使用 .str.findall
方法查找地址字段中的所有数字,然后获取最后一个值作为邮政编码。
这是一个例子:
数据:
customer address
0 shovon 1234 56th St, Bham, AL 35222
1 arsho 4th Ave, Dever, NY 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444
3 rahman this address has no number
代码:
import pandas as pd
data = {
"customer": [
"shovon", "arsho", "arshovon", "rahman"
],
"address": [
"1234 56th St, Bham, AL 35222",
"4th Ave, Dever, NY 25699",
"1245 apt 9 69th St, Rio, FL 54444",
"this address has no number"
]
}
df = pd.DataFrame(data)
df['postal_code'] = df['address'].str.findall(r'([0-9]\d+)').apply(
lambda x: x[-1] if len(x) >= 1 else '')
print(df)
输出:
customer address postal_code
0 shovon 1234 56th St, Bham, AL 35222 35222
1 arsho 4th Ave, Dever, NY 25699 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444 54444
3 rahman this address has no number
解释:
这将搜索地址字段中的每个数字组并将最后一个数字设置为邮政编码。如果地址字段中没有数字,它将设置一个空字符串作为邮政编码。
参考文献:
我需要将邮政编码(仅邮政编码)提取到新列中以供进一步分析。我主要在数据清理阶段使用 pandas。我之前尝试使用此代码:
import pandas as pd
df_participant = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')
print (df_participant[['address','postal_code']].head())
但是没用
这是输出:
非常感谢任何帮助!谢谢
如果所有邮政编码都是 5 个字符长,那么以下内容可能会有所帮助。
改变
df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')
到
df_participant['postal_code'] = df_participant['address'].str[-5::]
我端收到输出
address postal_code
0 Gg. Monginsidi No. 08\nMedan, Aceh 80734 80734
1 Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434 09434
2 Jalan Kebonjati No. 0\nAmbon, SS 57739 57739
3 Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156 76156
与str.extract
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d{5})')
#OR if the length of the postal code changes, just make it \d+ combined with "$"
df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d+)$')
但你在这里不需要它。只取字符串的最后 5 位数字,因为邮政编码总是在末尾。
df_participant['postal_code'] = df_participant['address'].str[-5:]
您可以使用 .str.findall
方法查找地址字段中的所有数字,然后获取最后一个值作为邮政编码。
这是一个例子:
数据:
customer address
0 shovon 1234 56th St, Bham, AL 35222
1 arsho 4th Ave, Dever, NY 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444
3 rahman this address has no number
代码:
import pandas as pd
data = {
"customer": [
"shovon", "arsho", "arshovon", "rahman"
],
"address": [
"1234 56th St, Bham, AL 35222",
"4th Ave, Dever, NY 25699",
"1245 apt 9 69th St, Rio, FL 54444",
"this address has no number"
]
}
df = pd.DataFrame(data)
df['postal_code'] = df['address'].str.findall(r'([0-9]\d+)').apply(
lambda x: x[-1] if len(x) >= 1 else '')
print(df)
输出:
customer address postal_code
0 shovon 1234 56th St, Bham, AL 35222 35222
1 arsho 4th Ave, Dever, NY 25699 25699
2 arshovon 1245 apt 9 69th St, Rio, FL 54444 54444
3 rahman this address has no number
解释:
这将搜索地址字段中的每个数字组并将最后一个数字设置为邮政编码。如果地址字段中没有数字,它将设置一个空字符串作为邮政编码。
参考文献: