如果前一个元素是 x,那么 return 它
If previous element is x, then return it
我需要为此页面构建一个网络抓取工具:https://www.valchov.cz/sluzby/specialni-sluzby-/
我已经想出了如何通过使用 previous_sibling
获取 "Vyvěšeno"
和 "Sejmuto"
,但现在我需要获取所有 div(到一个变量中)。我认为一些 if 语句会有所帮助。
上面有时有 1 个,有时最多有 3 个 div。
我当前数组中的样本:
['11.\xa0veřejné zasedání zastupitelstva obce se uskuteční 21.\xa012.\xa02011 v\xa019.30\xa0v\xa0budově obecního\xa0úřadu.', 'Vyvěšeno: 13. 12. 2011', 'Sejmuto: 21. 12. 2011']
代码:
from bs4 import BeautifulSoup
import requests
import re
from csv import writer
url= "https://www.valchov.cz/sluzby/specialni-sluzby-/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
sejmuto = soup.find_all("p", string=re.compile("Sejmuto:"))
with open("listings.csv", "w", encoding="utf8") as f:
thewriter = writer(f)
header= ["Name", "Name bezdiakritikyamezer" , "URL", "Zveřejněno", "Sejmuto"]
thewriter.writerow(header)
for hhh in sejmuto:
item1 = hhh.previous_sibling.previous_sibling.text
itemz = hhh.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text
item2 = (hhh.text)
item = [itemz, item1, item2]
print(item)
问题和预期输出不是那么清楚,但假设您的目标是获取所有链接并应用相应的日期,我建议您以这种方式调整您的脚本:
您不需要额外的 re
模块而是使用 css selectors
:
soup.select('p:-soup-contains("Sejmuto")')
Select 并迭代所有 find_previous_siblings()
,检查它是否包含 <a>
并将您的行写入 csv 否则中断 for 循环并继续:
for ps in e.find_previous('p').find_previous_siblings():
if ps.a:
name = ps.a.text
url = ps.a.get('href')
zve = e.find_previous('p').text.split(':')[-1]
sej = e.text.split('Sejmuto')[-1].strip('.:')
item = [name, url, zve, sej]
thewriter.writerow(item)
elif 'Přílohy' in ps.text:
continue
else:
break
注意 有一些不规则和不正确的拼写/标点符号(“:”、“.”或没有这些)
例子
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "https://www.valchov.cz/sluzby/specialni-sluzby-/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("listings.csv", "w", encoding="utf8") as f:
thewriter = writer(f)
header= ["Name", "URL", "Zveřejněno", "Sejmuto"]
thewriter.writerow(header)
for e in soup.select('p:-soup-contains("Sejmuto")'):
for ps in e.find_previous('p').find_previous_siblings():
if ps.a:
name = ps.a.text
url = ps.a.get('href')
zve = e.find_previous('p').text.split(':')[-1]
sej = e.text.split('Sejmuto')[-1].strip('.:')
item = [name, url, zve, sej]
thewriter.writerow(item)
elif 'Přílohy' in ps.text:
continue
else:
break
输出
...
我需要为此页面构建一个网络抓取工具:https://www.valchov.cz/sluzby/specialni-sluzby-/
我已经想出了如何通过使用 previous_sibling
获取 "Vyvěšeno"
和 "Sejmuto"
,但现在我需要获取所有 div(到一个变量中)。我认为一些 if 语句会有所帮助。
上面有时有 1 个,有时最多有 3 个 div。
我当前数组中的样本:
['11.\xa0veřejné zasedání zastupitelstva obce se uskuteční 21.\xa012.\xa02011 v\xa019.30\xa0v\xa0budově obecního\xa0úřadu.', 'Vyvěšeno: 13. 12. 2011', 'Sejmuto: 21. 12. 2011']
代码:
from bs4 import BeautifulSoup
import requests
import re
from csv import writer
url= "https://www.valchov.cz/sluzby/specialni-sluzby-/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
sejmuto = soup.find_all("p", string=re.compile("Sejmuto:"))
with open("listings.csv", "w", encoding="utf8") as f:
thewriter = writer(f)
header= ["Name", "Name bezdiakritikyamezer" , "URL", "Zveřejněno", "Sejmuto"]
thewriter.writerow(header)
for hhh in sejmuto:
item1 = hhh.previous_sibling.previous_sibling.text
itemz = hhh.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text
item2 = (hhh.text)
item = [itemz, item1, item2]
print(item)
问题和预期输出不是那么清楚,但假设您的目标是获取所有链接并应用相应的日期,我建议您以这种方式调整您的脚本:
您不需要额外的
re
模块而是使用css selectors
:soup.select('p:-soup-contains("Sejmuto")')
Select 并迭代所有
find_previous_siblings()
,检查它是否包含<a>
并将您的行写入 csv 否则中断 for 循环并继续:for ps in e.find_previous('p').find_previous_siblings(): if ps.a: name = ps.a.text url = ps.a.get('href') zve = e.find_previous('p').text.split(':')[-1] sej = e.text.split('Sejmuto')[-1].strip('.:') item = [name, url, zve, sej] thewriter.writerow(item) elif 'Přílohy' in ps.text: continue else: break
注意 有一些不规则和不正确的拼写/标点符号(“:”、“.”或没有这些)
例子
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "https://www.valchov.cz/sluzby/specialni-sluzby-/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("listings.csv", "w", encoding="utf8") as f:
thewriter = writer(f)
header= ["Name", "URL", "Zveřejněno", "Sejmuto"]
thewriter.writerow(header)
for e in soup.select('p:-soup-contains("Sejmuto")'):
for ps in e.find_previous('p').find_previous_siblings():
if ps.a:
name = ps.a.text
url = ps.a.get('href')
zve = e.find_previous('p').text.split(':')[-1]
sej = e.text.split('Sejmuto')[-1].strip('.:')
item = [name, url, zve, sej]
thewriter.writerow(item)
elif 'Přílohy' in ps.text:
continue
else:
break
输出
...