迭代从 selenium 到 bs4 的链接并打印剥离的字符串

Iterate links from selenium into bs4 and print stripped strings

意图:

1.Access http://blogdobg.com.br/ 使用 Selenium 的主页。

2.Identify文章links

3.Insert 每个 link 进入 bs4 并拉出文本

问题: 我可以打印所有 link 或将单个 link 移动到 bs4 用于解析和打印。我尝试阅读每个 link 都以相同的 link 重复多次而告终。

我两天前才开始学习,所以任何指点都将不胜感激。

from selenium import webdriver
from lxml import html
import requests
import re
from bs4 import BeautifulSoup

def read (html):
    html = browser.page_source
    soup = BeautifulSoup(html,"html.parser")
    for string in soup.article.stripped_strings:
            print(repr(string))

path_to_chromedriver = '/Users/yakir/chromedriver' 
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

url = 'http://blogdobg.com.br/'
browser.get(url)

articles = browser.find_elements_by_xpath("""//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))]//a""")

#get all the links
for link in articles:
    link.get_attribute("href")

#Attempt to print striped string from each link's landing page
for link in articles:
        read(link.get_attribute("href"))

##method for getting one link to work all the way through (currently commented out)
#article1 = articles[1].get_attribute("href")
#browser.get(article1)
#read(article1)

首先,您的函数 read() 具有 html 参数,而您直接在此函数内定义 html 变量。这没有任何意义:您的论点无论如何都会被忽略,BeautifulSoup(html,"html.parser") 将从 html = browser.page_source 中获取价值,但不会从参数 html

中获取价值

另一个问题:您不会获得所有链接

for link in articles:
    link.get_attribute("href")

您应该使用 list 并在每次迭代时附加值:

link_list = []
for link in articles:
    link_list.append(link.get_attribute("href"))

然后您可以像这样使用您的链接:

for link in link_list:
    r = requests.get(link)
    ...
    # do whatever you want to do with response