我如何只从 9gag 抓取图片帖子

How do I scrape only image posts from 9gag

我想抓取第一张图片 post 并将 url 列入黑名单以进行下一次搜索,他跳过已经使用过的 url 并搜索下一张图片 post. 我试过这个来找到第一张图片,但它不起作用。

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

错误: 追溯(最近一次通话): 文件“C:\Users\klaus\PycharmProjects\testTEST\main.py”,第 37 行,位于 gagposttitle = gagpost.find_element(By,value='img').get_attribute('alt') 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 763 行,在 find_element return self._execute(Command.FIND_CHILD_ELEMENT, 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 740 行,在 _execute 中 return self.parent.execute(命令,参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py”,第 428 行,在执行中 响应 = self.command_executor.execute(driver_command, 参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py”,第 345 行,在执行中 数据 = utils.dump_json(参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py”,第 23 行,在 dump_json 中 return json.dumps(json_struct) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json_init.py”,第 231 行,在转储中 return _default_encoder.encode(对象) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 199 行,编码 块 = self.iterencode(o, _one_shot=True) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 257 行,在 iterencode 中 return_iterencode(o, 0) 默认情况下,文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 179 行 raise TypeError(f'对象类型为 {o.class.name} ' 类型错误:类型类型的对象不是JSON可序列化

进程已完成,退出代码为 1

我也试过这个,有时有效,有时无效。

driver = webdriver.Chrome()

driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

如有任何帮助,我将不胜感激。

你可以这样实现:

from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
    # Find articles within the stream; these are the 'posts'
    articles = stream.find_elements(By.TAG_NAME, "article")
    # Debug number of articles
    print(f"Articles: {len(articles)}")
    # Iterate over each article
    for article in articles:
        # Try/except here because some articles are adverts, these are skipped
        try:
            # Find the article title
            title = article.find_element(By.CSS_SELECTOR, "header > a")
        except NoSuchElementException:
            continue
        # Print the article title
        print(f"Title: {title.text}")

这会打印出来

Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests

这并没有打印出页面上的所有帖子,因为它们是延迟加载的。这意味着帖子是在您滚动时从服务器获取的。 要加载它们,您需要对上述代码实现滚动功能。幸运的是,Python Selenium 的文档有一个 example for this particular case. You can also refer to a previous answer 我的关于实现的外观。

我只添加了足够的代码来获取标题,您可以从嵌入式循环中的 article 变量中提取所需的其余信息。