PYTHON scrapy selenium WebDriverWait

PYTHON scrapy selenium WebDriverWait

这里的专家,如果你不介意,我正在寻求你的帮助。

最近,我正在 python 中使用 scrapy 和 selenium 开发网络爬虫。我的心碎了。

我就想问一下,有没有可能你用了语句还是空的

WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))

获取那些元素。而且,它 甚至不需要 100 秒 就变空了。为什么?

顺便说一句,它是随机的,这意味着这种现象随时随地都会发生。

变空是否与我的网络连接有关?

你能帮我看看上面的问题,或者给我一些意见,建议吗?

非常感谢!

--------------------补充说明-------------------- ---

感谢提醒。

综上所述,我使用scrapyselenium抓取了一个关于评论的网站,并通过[=将用户名、发布时间、评论内容等写入.xlsx file 17=], 我希望它在收集完整信息的同时尽可能快。

一个有很多人评论的页面,由于评论文本太长而被收起来,这意味着每页将近 20 条评论都有其展开按钮。

因此,我需要使用seleniumclick the expand button,然后使用驱动程序来获取完整的评论。常识表明,单击展开按钮后需要 一点时间 才能加载,我认为所需时间取决于网络速度。所以在这里使用 WebDriverWait 似乎是一个明智的选择。经过我的实践,默认参数timeout=10poll_frequency=0.5好像太慢了,容易出错。所以我考虑使用 timeout=100poll_frequency=0.1.

的规范

但是问题是每次我运行项目通过cmd语句scrapy crawl spider,总是有几个评论爬取是空的,每次定位的空是不同的。我考虑过使用 time.sleep() 强制停止,但如果每个页面都这样做会花费很多时间,虽然它肯定是获取完整信息的更有用的方法。另外,在我看来,它看起来不太优雅,有点笨拙。

我的问题表达清楚了吗?

--------------------------------添加一些东西------------ --------------------

I got somwhere empty的具体含义如下图

----------------------------添加我的code--------------------------2022/5/18

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[@class='unfold']")))

# Here's how I think about and design my loop body.
# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.
for i in range(len(unfolds)):
    unfolds[i].click()
    time.sleep(1)
    # After the javascript, the `div[@class='review-content clearfix']` appear,
    # and some of the full review content will be put in a `<p></p>` label
    find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']/p")))
    full_content_p = [j.text for j in find_full_content_p]
    # and some of them will just put in `div[@class='review-content clearfix']` itself.
    find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']")))
    full_content_div = [j.text for j in find_full_content_div]
    
    # and I made a list merge
    full_content_p.extend(full_content_div)
    full_content.append("".join(full_content_p))
    words.append(len("".join(full_content_p)))
    time.sleep(1)
    
    # then put it away
    WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[@class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})

AND,这是我真正尊重的专家的代码命名声波(略有修改,核心代码未更改)

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service
    
driver = webdriver.Chrome()

driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
    show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
    if show_more:
        # scroll to the show more button, needed to avoid ElementClickInterceptedException
        driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
        show_more[0].click()
        review = review.find_element(By.XPATH, 'following-sibling::div')
        while review.get_attribute('class') == 'hidden':
            time.sleep(0.2)
        review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
    reviews.append(review.text)
    words.append(len(review.text))
    print('done',len(reviews),end='\r')
 pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})

为站点添加代码 douban。要将抓取的数据导出到 csv,请参阅下面旧部分中的 pandas 代码

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
    
driver = webdriver.Chrome(service=Service('...'))

driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
    show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
    if show_more:
        # scroll to the show more button, needed to avoid ElementClickInterceptedException
        driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
        show_more[0].click()
        review = review.find_element(By.XPATH, 'following-sibling::div')
        while review.get_attribute('class') == 'hidden':
            time.sleep(0.2)
        review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
    reviews.append(review.text)
    print('done',len(reviews),end='\r')

对于您提到的网站 (imdb.com),为了抓取隐藏内容,无需单击“显示更多”按钮,因为文本已经加载到 HTML 代码中,只是它没有显示在网站上。因此,您可以一次抓取所有评论。下面的代码将用户、日期和评论存储在单独的列表中,最后将数据保存到 .csv 文件。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
  
driver = webdriver.Chrome(service=Service(chromedriver_path))

driver.get('https://www.imdb.com/title/tt1683526/reviews')
# sets a maximum waiting time for .find_element() and similar commands
driver.implicitly_wait(10)

reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]

# store data in a csv file
import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)

要打印单个评论,您可以这样做

i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')

输出(截断)是

User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...