如何使用 Python 中的 str.replace() 函数将地址替换为只有数字和一些字母?
How can I Replace an address to have only numbers and some letters using str.replace() function in Python?
我正在尝试匹配参考索引 (coClean) 上的左地址和紧地址(来自单独的表),该索引是我在 #Python #JupyterNotebook
中使用以下公式创建的
import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")
import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
df1["coClean"]=cleanAddress(df1["Address"])
df = pd.merge(df1, df2,
on =['coClean'],
how ='inner')
这会生成一个 coClean 作为参考索引。
Address_x
coClean
Address_y
7 Pindara Bvd LANGWARRIN VIC 3910
73910
7 Pindara Blv, Langwarrin, VIC 3910
2a Manor St BACCHUS MARSH VIC 3340
23340
2a Manor Street, BACCHUS MARSH, VIC 3340
38 Sommersby Rd 库克点 VIC 3030
383030
38 Sommersby Road, Point Cook, VIC 3030
17 Moira Avenue, Carnegie, Vic 3163
173163
17 Moira Avenue, Carnegie, Vic 3163
17 Moira Avenue, Carnegie, Vic 3163
173163
17 Newman Avenue, Carnegie, VIC 3163
17 Moira Avenue, Carnegie, Vic 3163
173163
17 Maroona Rd, Carnegie VIC 3163
我面临的问题显然是,同一邮政编码下的某些地址具有相同的门牌号。但是由于参考索引相同,加入变得困难。
如何修改这个函数,使参考索引只包含
a. the house numbers
b. first four letters
c. postcode
因此,“23340”(2a manor street bacchus marsh vic 3340)的新参考变为
'2aman3340'?所以返回的列表如下:
coClean
7pind3910
2aman3340
38somm3030
17moir3163
17newm3163
17maroo3163
我试图修改函数以包含所有字母和数字
def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")
但是包括所有字母并不能解决问题,因为不同的表包含 street 作为 st。和道路作为路。因此,更好的策略是依靠带有一些首字母的门牌号码和邮政编码。
感谢您的宝贵建议。
更新:
我换了
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
和
def cleanAddress(series):
coclen=""
number_of_letters=0
if series:
for i in range(len(series)):
if series[i].isnumeric():
coclen+=series[i]
elif series[i].isalpha():
number_of_letters+=1
coclen+=series[i]
if number_of_letters==4:
break
for i in range(i,len(series)):
if series[i].isnumeric():
coclen+=series[i]
return coclen
这是我执行时 returns 的一个错误
cleanAddress(df1["Address"])
The full error is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
1328 def __nonzero__(self):
1329 raise ValueError(
-> 1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1332 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"],"Address_y":["7 Pindara Blv, Langwarrin, VIC 3910","2a Manor Street, BACCHUS MARSH, VIC 3340","38 Sommersby Road, Point Cook, VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"]})
def cleanAddress(series):
cocleans=[]
for address in series:
number_of_letters=0
coclean=""
for i in range(len(address)):
if address[i].isnumeric():
coclean+=address[i]
elif address[i].isalpha():
number_of_letters+=1
coclean+=address[i]
if number_of_letters==4:
break
for i in range(i,len(address)):
if address[i].isnumeric():
coclean+=address[i]
cocleans.append(coclean.lower())
return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])
我正在尝试匹配参考索引 (coClean) 上的左地址和紧地址(来自单独的表),该索引是我在 #Python #JupyterNotebook
中使用以下公式创建的import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")
import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
df1["coClean"]=cleanAddress(df1["Address"])
df = pd.merge(df1, df2,
on =['coClean'],
how ='inner')
这会生成一个 coClean 作为参考索引。
Address_x | coClean | Address_y |
---|---|---|
7 Pindara Bvd LANGWARRIN VIC 3910 | 73910 | 7 Pindara Blv, Langwarrin, VIC 3910 |
2a Manor St BACCHUS MARSH VIC 3340 | 23340 | 2a Manor Street, BACCHUS MARSH, VIC 3340 |
38 Sommersby Rd 库克点 VIC 3030 | 383030 | 38 Sommersby Road, Point Cook, VIC 3030 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Moira Avenue, Carnegie, Vic 3163 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Newman Avenue, Carnegie, VIC 3163 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Maroona Rd, Carnegie VIC 3163 |
我面临的问题显然是,同一邮政编码下的某些地址具有相同的门牌号。但是由于参考索引相同,加入变得困难。
如何修改这个函数,使参考索引只包含
a. the house numbers
b. first four letters
c. postcode
因此,“23340”(2a manor street bacchus marsh vic 3340)的新参考变为 '2aman3340'?所以返回的列表如下:
coClean |
---|
7pind3910 |
2aman3340 |
38somm3030 |
17moir3163 |
17newm3163 |
17maroo3163 |
我试图修改函数以包含所有字母和数字
def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")
但是包括所有字母并不能解决问题,因为不同的表包含 street 作为 st。和道路作为路。因此,更好的策略是依靠带有一些首字母的门牌号码和邮政编码。
感谢您的宝贵建议。
更新: 我换了
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
和
def cleanAddress(series):
coclen=""
number_of_letters=0
if series:
for i in range(len(series)):
if series[i].isnumeric():
coclen+=series[i]
elif series[i].isalpha():
number_of_letters+=1
coclen+=series[i]
if number_of_letters==4:
break
for i in range(i,len(series)):
if series[i].isnumeric():
coclen+=series[i]
return coclen
这是我执行时 returns 的一个错误
cleanAddress(df1["Address"])
The full error is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
1328 def __nonzero__(self):
1329 raise ValueError(
-> 1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1332 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"],"Address_y":["7 Pindara Blv, Langwarrin, VIC 3910","2a Manor Street, BACCHUS MARSH, VIC 3340","38 Sommersby Road, Point Cook, VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"]})
def cleanAddress(series):
cocleans=[]
for address in series:
number_of_letters=0
coclean=""
for i in range(len(address)):
if address[i].isnumeric():
coclean+=address[i]
elif address[i].isalpha():
number_of_letters+=1
coclean+=address[i]
if number_of_letters==4:
break
for i in range(i,len(address)):
if address[i].isnumeric():
coclean+=address[i]
cocleans.append(coclean.lower())
return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])