如何使用 pandas 函数 extract() 从地址中提取邮政编码?

How to extract zip code from address using pandas function extract()?

我需要将邮政编码(仅邮政编码)提取到新列中以供进一步分析。我主要在数据清理阶段使用 pandas。我之前尝试使用此代码:

import pandas as pd
df_participant = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')

df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')

print (df_participant[['address','postal_code']].head())

但是没用

这是输出:

非常感谢任何帮助!谢谢

如果所有邮政编码都是 5 个字符长,那么以下内容可能会有所帮助。

改变

df_participant['postal_code'] = df_participant['address'].str.extract(r'([0-9]\d+)')

df_participant['postal_code'] = df_participant['address'].str[-5::]

我端收到输出

                                             address postal_code
0           Gg. Monginsidi No. 08\nMedan, Aceh 80734       80734
1     Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434       09434
2             Jalan Kebonjati No. 0\nAmbon, SS 57739       57739
3    Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156       76156

str.extract

df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d{5})')

#OR if the length of the postal code changes, just make it \d+ combined with "$"

df_participant['postal_code'] = df_participant['address'].str.extract(r'(\d+)$')

但你在这里不需要它。只取字符串的最后 5 位数字,因为邮政编码总是在末尾。

df_participant['postal_code'] = df_participant['address'].str[-5:]

您可以使用 .str.findall 方法查找地址字段中的所有数字,然后获取最后一个值作为邮政编码。

这是一个例子:

数据:

   customer                            address
0    shovon       1234 56th St, Bham, AL 35222
1     arsho           4th Ave, Dever, NY 25699
2  arshovon  1245 apt 9 69th St, Rio, FL 54444
3    rahman         this address has no number

代码:

import pandas as pd

data = {
    "customer": [
        "shovon", "arsho", "arshovon", "rahman"
    ],
    "address": [
        "1234 56th St, Bham, AL 35222",
        "4th Ave, Dever, NY 25699",
        "1245 apt 9 69th St, Rio, FL 54444",
        "this address has no number"
    ]
}

df = pd.DataFrame(data)    
df['postal_code'] = df['address'].str.findall(r'([0-9]\d+)').apply(
    lambda x: x[-1] if len(x) >= 1 else '')
print(df)

输出:

   customer                            address postal_code
0    shovon       1234 56th St, Bham, AL 35222       35222
1     arsho           4th Ave, Dever, NY 25699       25699
2  arshovon  1245 apt 9 69th St, Rio, FL 54444       54444
3    rahman         this address has no number            

解释:

这将搜索地址字段中的每个数字组并将最后一个数字设置为邮政编码。如果地址字段中没有数字,它将设置一个空字符串作为邮政编码。

参考文献: