我如何只从 9gag 抓取图片帖子
How do I scrape only image posts from 9gag
我想抓取第一张图片 post 并将 url 列入黑名单以进行下一次搜索,他跳过已经使用过的 url 并搜索下一张图片 post.
我试过这个来找到第一张图片,但它不起作用。
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)
错误:
追溯(最近一次通话):
文件“C:\Users\klaus\PycharmProjects\testTEST\main.py”,第 37 行,位于
gagposttitle = gagpost.find_element(By,value='img').get_attribute('alt')
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 763 行,在 find_element
return self._execute(Command.FIND_CHILD_ELEMENT,
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 740 行,在 _execute 中
return self.parent.execute(命令,参数)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py”,第 428 行,在执行中
响应 = self.command_executor.execute(driver_command, 参数)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py”,第 345 行,在执行中
数据 = utils.dump_json(参数)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py”,第 23 行,在 dump_json 中
return json.dumps(json_struct)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json_init.py”,第 231 行,在转储中
return _default_encoder.encode(对象)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 199 行,编码
块 = self.iterencode(o, _one_shot=True)
文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 257 行,在 iterencode 中
return_iterencode(o, 0)
默认情况下,文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 179 行
raise TypeError(f'对象类型为 {o.class.name} '
类型错误:类型类型的对象不是JSON可序列化
进程已完成,退出代码为 1
我也试过这个,有时有效,有时无效。
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)
如有任何帮助,我将不胜感激。
你可以这样实现:
from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
# Find articles within the stream; these are the 'posts'
articles = stream.find_elements(By.TAG_NAME, "article")
# Debug number of articles
print(f"Articles: {len(articles)}")
# Iterate over each article
for article in articles:
# Try/except here because some articles are adverts, these are skipped
try:
# Find the article title
title = article.find_element(By.CSS_SELECTOR, "header > a")
except NoSuchElementException:
continue
# Print the article title
print(f"Title: {title.text}")
这会打印出来
Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests
这并没有打印出页面上的所有帖子,因为它们是延迟加载的。这意味着帖子是在您滚动时从服务器获取的。
要加载它们,您需要对上述代码实现滚动功能。幸运的是,Python Selenium 的文档有一个 example for this particular case. You can also refer to a previous answer 我的关于实现的外观。
我只添加了足够的代码来获取标题,您可以从嵌入式循环中的 article
变量中提取所需的其余信息。
我想抓取第一张图片 post 并将 url 列入黑名单以进行下一次搜索,他跳过已经使用过的 url 并搜索下一张图片 post. 我试过这个来找到第一张图片,但它不起作用。
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)
错误: 追溯(最近一次通话): 文件“C:\Users\klaus\PycharmProjects\testTEST\main.py”,第 37 行,位于 gagposttitle = gagpost.find_element(By,value='img').get_attribute('alt') 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 763 行,在 find_element return self._execute(Command.FIND_CHILD_ELEMENT, 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”,第 740 行,在 _execute 中 return self.parent.execute(命令,参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py”,第 428 行,在执行中 响应 = self.command_executor.execute(driver_command, 参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py”,第 345 行,在执行中 数据 = utils.dump_json(参数) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py”,第 23 行,在 dump_json 中 return json.dumps(json_struct) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json_init.py”,第 231 行,在转储中 return _default_encoder.encode(对象) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 199 行,编码 块 = self.iterencode(o, _one_shot=True) 文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 257 行,在 iterencode 中 return_iterencode(o, 0) 默认情况下,文件“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”,第 179 行 raise TypeError(f'对象类型为 {o.class.name} ' 类型错误:类型类型的对象不是JSON可序列化
进程已完成,退出代码为 1
我也试过这个,有时有效,有时无效。
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)
如有任何帮助,我将不胜感激。
你可以这样实现:
from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
# Find articles within the stream; these are the 'posts'
articles = stream.find_elements(By.TAG_NAME, "article")
# Debug number of articles
print(f"Articles: {len(articles)}")
# Iterate over each article
for article in articles:
# Try/except here because some articles are adverts, these are skipped
try:
# Find the article title
title = article.find_element(By.CSS_SELECTOR, "header > a")
except NoSuchElementException:
continue
# Print the article title
print(f"Title: {title.text}")
这会打印出来
Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests
这并没有打印出页面上的所有帖子,因为它们是延迟加载的。这意味着帖子是在您滚动时从服务器获取的。 要加载它们,您需要对上述代码实现滚动功能。幸运的是,Python Selenium 的文档有一个 example for this particular case. You can also refer to a previous answer 我的关于实现的外观。
我只添加了足够的代码来获取标题,您可以从嵌入式循环中的 article
变量中提取所需的其余信息。