我如何只从 9gag 抓取图片帖子

Question

我想抓取第一张图片 post 并将 url 列入黑名单以进行下一次搜索，他跳过已经使用过的 url 并搜索下一张图片 post. 我试过这个来找到第一张图片，但它不起作用。

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

错误：追溯（最近一次通话）：文件“C:\Users\klaus\PycharmProjects\testTEST\main.py”，第 37 行，位于 gagposttitle = gagpost.find_element(By,value='img').get_attribute('alt') 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”，第 763 行，在 find_element return self._execute(Command.FIND_CHILD_ELEMENT, 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”，第 740 行，在 _execute 中 return self.parent.execute（命令，参数）文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py”，第 428 行，在执行中响应 = self.command_executor.execute(driver_command, 参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py”，第 345 行，在执行中数据 = utils.dump_json（参数）文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py”，第 23 行，在 dump_json 中 return json.dumps(json_struct) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json_init.py”，第 231 行，在转储中 return _default_encoder.encode(对象) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”，第 199 行，编码块 = self.iterencode(o, _one_shot=True) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”，第 257 行，在 iterencode 中 return_iterencode(o, 0) 默认情况下，文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”，第 179 行 raise TypeError(f'对象类型为 {o.class.name} ' 类型错误：类型类型的对象不是JSON可序列化

进程已完成，退出代码为 1

我也试过这个，有时有效，有时无效。

driver = webdriver.Chrome()

driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

如有任何帮助，我将不胜感激。

Answer 1

你可以这样实现：

from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
    # Find articles within the stream; these are the 'posts'
    articles = stream.find_elements(By.TAG_NAME, "article")
    # Debug number of articles
    print(f"Articles: {len(articles)}")
    # Iterate over each article
    for article in articles:
        # Try/except here because some articles are adverts, these are skipped
        try:
            # Find the article title
            title = article.find_element(By.CSS_SELECTOR, "header > a")
        except NoSuchElementException:
            continue
        # Print the article title
        print(f"Title: {title.text}")

这会打印出来

Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests

这并没有打印出页面上的所有帖子，因为它们是延迟加载的。这意味着帖子是在您滚动时从服务器获取的。要加载它们，您需要对上述代码实现滚动功能。幸运的是，Python Selenium 的文档有一个 example for this particular case. You can also refer to a previous answer 我的关于实现的外观。

我只添加了足够的代码来获取标题，您可以从嵌入式循环中的 article 变量中提取所需的其余信息。

我如何只从 9gag 抓取图片帖子

How do I scrape only image posts from 9gag

python

selenium

python-3.x

web

selenium-webdriver