BeautifulSoup:如何从这篇link中获取所有文章link?
BeautifulSoup: how to get all article links from this link?
我想从“https://www.cnnindonesia.com/search?query=covid”获取所有文章 link
这是我的代码:
links = []
base_url = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = bs(base_url.text, 'html.parser')
cont = soup.find_all('div', class_='container')
for l in cont:
l_cont = l.find_all('div', class_='l_content')
for bf in l_cont:
bf_cont = bf.find_all('div', class_='box feed')
for lm in bf_cont:
lm_cont = lm.find('div', class_='list media_rows middle')
for article in lm_cont.find_all('article'):
a_cont = article.find('a', href=True)
if url:
link = a['href']
links.append(link)
结果如下:
links
[]
抱歉,信誉不足无法添加评论。
我认为这一行:
for url in lm_row_cont.find_all('a'):
a 标签应为 '<a>'
或者你可以在抓取 div.
后使用正则表达式(跳过上面的)来匹配相关项目
每篇文章都有这样的结构:
<article class="col_4">
<a href="https://www.cnnindonesia.com/...">
<span>...</span>
<h2 class="title">...</h2>
</a>
</article>
迭代 article 元素然后查找 a 元素更简单。
尝试:
from bs4 import BeautifulSoup
import requests
links = []
response = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.find_all('article'):
url = article.find('a', href=True)
if url:
link = url['href']
print(link)
links.append(link)
print(links)
输出:
https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara
...
['https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara', ...
'https://www.cnnindonesia.com/gaya-hidup/...ikut-penerbangan-gravitasi-nol']
更新:
如果要提取由 JavaScript 动态添加到 <div class="list media_rows middle">
元素中的 URL,则必须使用类似 Selenium 的东西,它可以在整页完成后提取内容在网络浏览器中呈现。
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.cnnindonesia.com/search?query=covid'
links = []
options = webdriver.ChromeOptions()
pathToChromeDriver = "chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
options=options)
try:
browser.get(url)
browser.implicitly_wait(10)
html = browser.page_source
content = browser.find_element(By.CLASS_NAME, 'media_rows')
for elt in content.find_elements(By.TAG_NAME, 'article'):
link = elt.find_element(By.TAG_NAME, 'a')
href = link.get_attribute('href')
if href:
print(href)
links.append(href)
finally:
browser.quit()
我想从“https://www.cnnindonesia.com/search?query=covid”获取所有文章 link 这是我的代码:
links = []
base_url = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = bs(base_url.text, 'html.parser')
cont = soup.find_all('div', class_='container')
for l in cont:
l_cont = l.find_all('div', class_='l_content')
for bf in l_cont:
bf_cont = bf.find_all('div', class_='box feed')
for lm in bf_cont:
lm_cont = lm.find('div', class_='list media_rows middle')
for article in lm_cont.find_all('article'):
a_cont = article.find('a', href=True)
if url:
link = a['href']
links.append(link)
结果如下:
links
[]
抱歉,信誉不足无法添加评论。
我认为这一行:
for url in lm_row_cont.find_all('a'):
a 标签应为 '<a>'
或者你可以在抓取 div.
后使用正则表达式(跳过上面的)来匹配相关项目每篇文章都有这样的结构:
<article class="col_4">
<a href="https://www.cnnindonesia.com/...">
<span>...</span>
<h2 class="title">...</h2>
</a>
</article>
迭代 article 元素然后查找 a 元素更简单。
尝试:
from bs4 import BeautifulSoup
import requests
links = []
response = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.find_all('article'):
url = article.find('a', href=True)
if url:
link = url['href']
print(link)
links.append(link)
print(links)
输出:
https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara
...
['https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara', ...
'https://www.cnnindonesia.com/gaya-hidup/...ikut-penerbangan-gravitasi-nol']
更新:
如果要提取由 JavaScript 动态添加到 <div class="list media_rows middle">
元素中的 URL,则必须使用类似 Selenium 的东西,它可以在整页完成后提取内容在网络浏览器中呈现。
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.cnnindonesia.com/search?query=covid'
links = []
options = webdriver.ChromeOptions()
pathToChromeDriver = "chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
options=options)
try:
browser.get(url)
browser.implicitly_wait(10)
html = browser.page_source
content = browser.find_element(By.CLASS_NAME, 'media_rows')
for elt in content.find_elements(By.TAG_NAME, 'article'):
link = elt.find_element(By.TAG_NAME, 'a')
href = link.get_attribute('href')
if href:
print(href)
links.append(href)
finally:
browser.quit()