使用 beautifulsoup 和 selenium 抓取多页网站 returns 空字符串列表
Using beautifulsoup and selenium to scrape multipage website returns list of empty strings
我想从网站上反复抓取文本。该网页的每个页面都具有相同的 html 结构。
每次附加以下字符串时,我都会使用 selenium 导航到下一页:text_i_want1
、text_i_wantA
、text_i_wantB
、text_i_wantC
[<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want1
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantA
</div>
</div>
<div class="row">
<div>
source: text_i_wantB
</div>
</div>
<div class="row">
<div>
number: text_i_wantC
<span class="processlink">
<a href="url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want2
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAA
</div>
</div>
<div class="row">
<div>
source: text_i_wantBB
</div>
</div>
<div class="row">
<div>
number: text_i_wantCC
<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want3
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAAA
</div>
</div>
<div class="row">
<div>
source: text_i_wantBBB
</div>
</div>
<div class="row">
<div>
number: text_i_wantCCC
<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
.
.
.
.
</div>]
因为text_i_want1
和text_i_wantA
、text_i_wantB
、text_i_wantC
不在同一个div
,所以我用beautifulsoup得到了所有<div class="col-12">
但将输出切片 [1::2]
以便仅在每秒 <div class="col-12">
上迭代以获得 text_i_wantA
、text_i_wantB
、text_i_wantC
。
为了可读性,下面我只包含三个其他结构相同的每页 20 <div class="col-12">
。
title,date,name,number = [],[],[],[]
while True:
soup = bs(driver.page_source, 'html5lib')
for div in soup.find_all('a', attrs={'title':'ad i'}):
titl = div.get_text(strip=True)
title.append(titl)
else:
break
for col in soup.find_all('div', attrs={'class':'col-12'})[1::2]:
row = []
for entry in col.select('div.row div'):
target = entry.find_all(text=True, recursive=False)
row.append(target[0].strip())
name.append(row[0])
date.append(row[1])
number.append(row[2])
next_btn = driver.find_elements_by_css_selector(".page-next button")
if next_btn:
actions = ActionChains(driver)
actions.move_to_element(next_btn[0]).click().perform()
time.sleep(4)
else:
break
driver.close()
预期输出:
title = ["text_i_want1", "text_i_want2", ...]
date = ["text_i_wantA", "text_i_wantAA", ...]
name = ["text_i_wantB", "text_i_wantBB", ...]
number = ["text_i_wantC", "text_i_wantCC", ...]
问题:实际输出
title = ["text_i_want1", "text_i_want2", ...]
date = ['text_i_wantA', 'text_i_wantAA', ...]
name = ['', '', '', '', '', '', '', '', '', '']
number = ['', '', '', '', '', '', '', '', '', '']
为什么name
和number
都是空的,在html里面有字符值。是 css 的问题还是循环本身的问题?
................................................ ..................................................... ..................................................... ..................................................... ......
更新问题:整合
DRIVER_PATH = 'chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
prefs = {"profile.default_content_settings.popups": 0,
"download.default_directory": r"C:\Users\aaa",
"directory_upgrade": True,
"plugins.always_open_pdf_externally": True}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://parldok.thueringen.de/ParlDok/formalkriterien')
driver.maximize_window()
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('LegislaturperiodenList').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList')))
selenium.webdriver.support.ui.Select(element).select_by_value('7')
except Exception as ex:
print(ex)
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('DokumententypId').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'DokumententypId')))
selenium.webdriver.support.ui.Select(element).select_by_value('10')
except Exception as ex:
print(ex)
driver.find_element_by_css_selector('button[class="btn btn-primary"][type="submit"]').click()
这就是我设置 selenium 以便能够导航到下一页的方式。你能帮我把东西放在一起吗?我不知道如何将您的方法与 selenium 相结合。
UPDATED ANSWER:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from math import ceil
allin = []
def parser(soup):
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
allin.append(pd.DataFrame(goal))
def main(url):
with requests.Session() as req:
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
print("Extracting Page# 1")
parser(soup)
try:
nextpage = int(soup.select_one(
'.pd_resultcount').contents[0].split()[-1]) / 10
for page in range(2, ceil(nextpage) + 1):
print(f"Extracting Page# {page}")
r = req.get(f"{url}/{page}")
soup = BeautifulSoup(r.text, 'lxml')
parser(soup)
except AttributeError:
print('No More Result Found!')
if __name__ == "__main__":
main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
final = pd.concat(allin, ignore_index=True)
print(final)
final.to_csv('data.csv', index=False)
输出:
0 ... 3
0 GRW-Fördermittelanträge eines Fertigteil-Herst... ... Dokumentnummer: 7/2303
1 Vertretung der Menschen mit Behinderungen in T... ... Dokumentnummer: 7/2307
2 Rassistische und rechtsextremistische Aktivitä... ... Dokumentnummer: 7/2306
3 Antisemitische Überfälle, Leugnung des Holocau... ... Dokumentnummer: 7/2302
4 Finanzierung von Kindertagesstätten in Thüring... ... Dokumentnummer: 7/2301
... ... ... ...
2299 NaturFreunde Thüringen e.V. - Teil I ... Dokumentnummer: 7/6
2300 Aktuelle Sicherheitslage für Thüringer Kunst- ... ... Dokumentnummer: 7/5
2301 Stand der Planungen zur Ortsumgehung der Stadt... ... Dokumentnummer: 7/3
2302 Übergangsbestimmungen zur Neuordnung der Organ... ... Dokumentnummer: 7/2
2303 Baustellen entlang der Autobahn 71 zwischen de... ... Dokumentnummer: 7/1
[2304 rows x 4 columns]
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = requests.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
df = pd.DataFrame(goal)
print(df)
main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
输出:
0 ... 3
0 GRW-Fördermittelanträge eines Fertigteil-Herst... ... Dokumentnummer: 7/2303
1 Vertretung der Menschen mit Behinderungen in T... ... Dokumentnummer: 7/2307
2 Rassistische und rechtsextremistische Aktivitä... ... Dokumentnummer: 7/2306
3 Antisemitische Überfälle, Leugnung des Holocau... ... Dokumentnummer: 7/2302
4 Finanzierung von Kindertagesstätten in Thüring... ... Dokumentnummer: 7/2301
5 Ausstattung der unteren Naturschutzbehörden ... Dokumentnummer: 7/2300
6 Antifa-Szene, insbesondere das Arnstädter "Akt... ... Dokumentnummer: 7/2291
7 Finanzierung der Beschaffung von Ausrüstung, A... ... Dokumentnummer: 7/2309
8 Statistik der Kfz-Diebstähle ... Dokumentnummer: 7/2308
9 Unterstützung des Freistaats Thüringen für Sta... ... Dokumentnummer: 7/2299
[10 rows x 4 columns]
我想从网站上反复抓取文本。该网页的每个页面都具有相同的 html 结构。
每次附加以下字符串时,我都会使用 selenium 导航到下一页:text_i_want1
、text_i_wantA
、text_i_wantB
、text_i_wantC
[<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want1
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantA
</div>
</div>
<div class="row">
<div>
source: text_i_wantB
</div>
</div>
<div class="row">
<div>
number: text_i_wantC
<span class="processlink">
<a href="url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want2
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAA
</div>
</div>
<div class="row">
<div>
source: text_i_wantBB
</div>
</div>
<div class="row">
<div>
number: text_i_wantCC
<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want3
</a>
</div>,
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAAA
</div>
</div>
<div class="row">
<div>
source: text_i_wantBBB
</div>
</div>
<div class="row">
<div>
number: text_i_wantCCC
<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>
</div>
</div>
</div>,
<div class="col-12">
.
.
.
.
</div>]
因为text_i_want1
和text_i_wantA
、text_i_wantB
、text_i_wantC
不在同一个div
,所以我用beautifulsoup得到了所有<div class="col-12">
但将输出切片 [1::2]
以便仅在每秒 <div class="col-12">
上迭代以获得 text_i_wantA
、text_i_wantB
、text_i_wantC
。
为了可读性,下面我只包含三个其他结构相同的每页 20 <div class="col-12">
。
title,date,name,number = [],[],[],[]
while True:
soup = bs(driver.page_source, 'html5lib')
for div in soup.find_all('a', attrs={'title':'ad i'}):
titl = div.get_text(strip=True)
title.append(titl)
else:
break
for col in soup.find_all('div', attrs={'class':'col-12'})[1::2]:
row = []
for entry in col.select('div.row div'):
target = entry.find_all(text=True, recursive=False)
row.append(target[0].strip())
name.append(row[0])
date.append(row[1])
number.append(row[2])
next_btn = driver.find_elements_by_css_selector(".page-next button")
if next_btn:
actions = ActionChains(driver)
actions.move_to_element(next_btn[0]).click().perform()
time.sleep(4)
else:
break
driver.close()
预期输出:
title = ["text_i_want1", "text_i_want2", ...]
date = ["text_i_wantA", "text_i_wantAA", ...]
name = ["text_i_wantB", "text_i_wantBB", ...]
number = ["text_i_wantC", "text_i_wantCC", ...]
问题:实际输出
title = ["text_i_want1", "text_i_want2", ...]
date = ['text_i_wantA', 'text_i_wantAA', ...]
name = ['', '', '', '', '', '', '', '', '', '']
number = ['', '', '', '', '', '', '', '', '', '']
为什么name
和number
都是空的,在html里面有字符值。是 css 的问题还是循环本身的问题?
................................................ ..................................................... ..................................................... ..................................................... ......
更新问题:整合
DRIVER_PATH = 'chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
prefs = {"profile.default_content_settings.popups": 0,
"download.default_directory": r"C:\Users\aaa",
"directory_upgrade": True,
"plugins.always_open_pdf_externally": True}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://parldok.thueringen.de/ParlDok/formalkriterien')
driver.maximize_window()
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('LegislaturperiodenList').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList')))
selenium.webdriver.support.ui.Select(element).select_by_value('7')
except Exception as ex:
print(ex)
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('DokumententypId').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'DokumententypId')))
selenium.webdriver.support.ui.Select(element).select_by_value('10')
except Exception as ex:
print(ex)
driver.find_element_by_css_selector('button[class="btn btn-primary"][type="submit"]').click()
这就是我设置 selenium 以便能够导航到下一页的方式。你能帮我把东西放在一起吗?我不知道如何将您的方法与 selenium 相结合。
UPDATED ANSWER:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from math import ceil
allin = []
def parser(soup):
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
allin.append(pd.DataFrame(goal))
def main(url):
with requests.Session() as req:
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
print("Extracting Page# 1")
parser(soup)
try:
nextpage = int(soup.select_one(
'.pd_resultcount').contents[0].split()[-1]) / 10
for page in range(2, ceil(nextpage) + 1):
print(f"Extracting Page# {page}")
r = req.get(f"{url}/{page}")
soup = BeautifulSoup(r.text, 'lxml')
parser(soup)
except AttributeError:
print('No More Result Found!')
if __name__ == "__main__":
main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
final = pd.concat(allin, ignore_index=True)
print(final)
final.to_csv('data.csv', index=False)
输出:
0 ... 3
0 GRW-Fördermittelanträge eines Fertigteil-Herst... ... Dokumentnummer: 7/2303
1 Vertretung der Menschen mit Behinderungen in T... ... Dokumentnummer: 7/2307
2 Rassistische und rechtsextremistische Aktivitä... ... Dokumentnummer: 7/2306
3 Antisemitische Überfälle, Leugnung des Holocau... ... Dokumentnummer: 7/2302
4 Finanzierung von Kindertagesstätten in Thüring... ... Dokumentnummer: 7/2301
... ... ... ...
2299 NaturFreunde Thüringen e.V. - Teil I ... Dokumentnummer: 7/6
2300 Aktuelle Sicherheitslage für Thüringer Kunst- ... ... Dokumentnummer: 7/5
2301 Stand der Planungen zur Ortsumgehung der Stadt... ... Dokumentnummer: 7/3
2302 Übergangsbestimmungen zur Neuordnung der Organ... ... Dokumentnummer: 7/2
2303 Baustellen entlang der Autobahn 71 zwischen de... ... Dokumentnummer: 7/1
[2304 rows x 4 columns]
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = requests.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
df = pd.DataFrame(goal)
print(df)
main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
输出:
0 ... 3
0 GRW-Fördermittelanträge eines Fertigteil-Herst... ... Dokumentnummer: 7/2303
1 Vertretung der Menschen mit Behinderungen in T... ... Dokumentnummer: 7/2307
2 Rassistische und rechtsextremistische Aktivitä... ... Dokumentnummer: 7/2306
3 Antisemitische Überfälle, Leugnung des Holocau... ... Dokumentnummer: 7/2302
4 Finanzierung von Kindertagesstätten in Thüring... ... Dokumentnummer: 7/2301
5 Ausstattung der unteren Naturschutzbehörden ... Dokumentnummer: 7/2300
6 Antifa-Szene, insbesondere das Arnstädter "Akt... ... Dokumentnummer: 7/2291
7 Finanzierung der Beschaffung von Ausrüstung, A... ... Dokumentnummer: 7/2309
8 Statistik der Kfz-Diebstähle ... Dokumentnummer: 7/2308
9 Unterstützung des Freistaats Thüringen für Sta... ... Dokumentnummer: 7/2299
[10 rows x 4 columns]