不能让脚本在某些地址的特定位置保留 space
Can't let a script keep a space in a certain position within some addresses
我正在尝试从静态 webpage 中抓取所有文件名及其相关地址。除了将 space 保留在某些地址内的特定位置外,我已经创建的脚本几乎可以准确地获取它们。更清楚地说,除其他结果外,该脚本在控制台中打印了以下内容:
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
而我的预期输出是(注意 Carlisle Avenue
之前的 space):
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
当前方法:
import requests
from bs4 import BeautifulSoup
link = 'https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = item.select_one("tr > td:has(strong:-soup-contains('File:'))").get_text(strip=True).replace("File:","").replace(" "," ").strip()
addr_list = [i.text for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")]
for addr in addr_list:
print(file,addr)
我得到的输出(截断):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
我希望得到的输出(注意 Carlisle Avenue
和 Lyall Street
之前的 space):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
而不是 i.text
使用 i.get_text()
和 separator=
参数:
import requests
from bs4 import BeautifulSoup
link = "https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = (
item.select_one("tr > td:has(strong:-soup-contains('File:'))")
.get_text(strip=True)
.replace("File:", "")
.replace(" ", " ")
.strip()
)
addr_list = [
i.get_text(strip=True, separator=" ")
for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")
]
for addr in addr_list:
print(file, addr)
打印:
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
RZ000098 812 Craigflower Road
RZ000083 881 Craigflower Road
RZ000071 820 Dunsmuir Road
...
我正在尝试从静态 webpage 中抓取所有文件名及其相关地址。除了将 space 保留在某些地址内的特定位置外,我已经创建的脚本几乎可以准确地获取它们。更清楚地说,除其他结果外,该脚本在控制台中打印了以下内容:
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
而我的预期输出是(注意 Carlisle Avenue
之前的 space):
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
当前方法:
import requests
from bs4 import BeautifulSoup
link = 'https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = item.select_one("tr > td:has(strong:-soup-contains('File:'))").get_text(strip=True).replace("File:","").replace(" "," ").strip()
addr_list = [i.text for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")]
for addr in addr_list:
print(file,addr)
我得到的输出(截断):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
我希望得到的输出(注意 Carlisle Avenue
和 Lyall Street
之前的 space):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
而不是 i.text
使用 i.get_text()
和 separator=
参数:
import requests
from bs4 import BeautifulSoup
link = "https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = (
item.select_one("tr > td:has(strong:-soup-contains('File:'))")
.get_text(strip=True)
.replace("File:", "")
.replace(" ", " ")
.strip()
)
addr_list = [
i.get_text(strip=True, separator=" ")
for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")
]
for addr in addr_list:
print(file, addr)
打印:
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
RZ000098 812 Craigflower Road
RZ000083 881 Craigflower Road
RZ000071 820 Dunsmuir Road
...