如何使用 Python 中的 str.replace() 函数将地址替换为只有数字和一些字母?

How can I Replace an address to have only numbers and some letters using str.replace() function in Python?

我正在尝试匹配参考索引 (coClean) 上的左地址和紧地址(来自单独的表),该索引是我在 #Python #JupyterNotebook

中使用以下公式创建的
import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")

import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])

df1["coClean"]=cleanAddress(df1["Address"]) 
df = pd.merge(df1, df2, 
                      on =['coClean'], 
                      how ='inner') 

这会生成一个 coClean 作为参考索引。

Address_x coClean Address_y
7 Pindara Bvd LANGWARRIN VIC 3910 73910 7 Pindara Blv, Langwarrin, VIC 3910
2a Manor St BACCHUS MARSH VIC 3340 23340 2a Manor Street, BACCHUS MARSH, VIC 3340
38 Sommersby Rd​​ 库克点 VIC 3030 383030 38 Sommersby Road, Point Cook, VIC 3030
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Moira Avenue, Carnegie, Vic 3163
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Newman Avenue, Carnegie, VIC 3163
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Maroona Rd, Carnegie VIC 3163

我面临的问题显然是,同一邮政编码下的某些地址具有相同的门牌号。但是由于参考索引相同,加入变得困难。

如何修改这个函数,使参考索引只包含

a. the house numbers
b. first four letters
c. postcode

因此,“23340”(2a manor street bacchus marsh vic 3340)的新参考变为 '2aman3340'?所以返回的列表如下:

coClean
7pind3910
2aman3340
38somm3030
17moir3163
17newm3163
17maroo3163

我试图修改函数以包含所有字母和数字

def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")

但是包括所有字母并不能解决问题,因为不同的表包含 street 作为 st。和道路作为路。因此,更好的策略是依靠带有一些首字母的门牌号码和邮政编码。

感谢您的宝贵建议。

更新: 我换了

def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])

def cleanAddress(series):
    coclen=""
    number_of_letters=0
    if series:
        for i in range(len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
            elif series[i].isalpha():
                number_of_letters+=1
                coclen+=series[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
    return coclen

这是我执行时 returns 的一个错误

cleanAddress(df1["Address"])

The full error is as follows:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])

1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1328     def __nonzero__(self):
   1329         raise ValueError(
-> 1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1332         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"],"Address_y":["7 Pindara Blv, Langwarrin, VIC 3910","2a Manor Street, BACCHUS MARSH, VIC 3340","38 Sommersby Road, Point Cook, VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"]})
def cleanAddress(series):
    cocleans=[]
    for address in series:
        number_of_letters=0
        coclean=""
        for i in range(len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
            elif address[i].isalpha():
                number_of_letters+=1
                coclean+=address[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
        cocleans.append(coclean.lower())
    return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])