如何使用 Selenium 获取 AJAX 条评论的回复?

How to use Selenium to get AJAX comment's replies?

我正在学习并尝试从 this website 收集论据。我正在使用 BeautifulSoup 和 Selenium 来执行此操作。

现在我可以收集所有的论点,除了评论的回复。要查看回复,我们需要单击红色箭头(查看回复)。请注意,并非所有评论都包含回复。

在我看来,我能想到的解决方案有两种:

1.As 以绿色突出显示,我注意到每个参数都包含唯一 ID(援助)。我需要 Selenium 单击红色箭头以便列出回复。但是我怎样才能导航到查看回复?我只知道求助和查看回复有相同的标签名

2.Use Selenium 单击所有评论中的所有查看回复,然后使用 BeautifulSoup 获取标签中的值。我认为第二种选择更容易。以下代码是我为第二个选项所做的:

while True:
    try:
        wait3 = WebDriverWait(driver, 5)
        btn_view_reply =   wait3.until(EC.element_to_be_clickable((By.CLASS_NAME, "msg-contain")))
    btn_view_reply.click()

    wait4 = WebDriverWait(driver, 3)
    loadReply = wait4.until(EC.presence_of_element_located((By.CLASS_NAME,"msg-contain")))

    content = driver.execute_script("return document.documentElement.outerHTML;")
          
except TimeoutException:
    break

问题是 Selenium 不会移动到下一个“查看回复”按钮。你能就此提出一些建议吗?谢谢。

这里有一个不同的方法:

  • 明确等待 post 出现
  • 对于每个post(aid属性唯一标识一个post):

    • 查看有多少回复
    • 如果有回复:

      • 点击link
      • 明确等待回复出现
      • 解析回复

实施:

from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get('http://www.debate.org/opinions/is-global-climate-change-man-made')
wait = WebDriverWait(driver, 5)

# wait for posts to load
posts_xpath = '//div[@id="debate"]/div/ul/li[@aid]'
wait.until(EC.presence_of_element_located((By.XPATH, posts_xpath)))

# collect posts data
posts = []
for post in driver.find_elements_by_xpath(posts_xpath):
    aid = post.get_attribute('aid')
    contents = post.find_element_by_tag_name('p').text

    replies = []

    # check how many replies are there
    reply_count = int(post.find_element_by_class_name('m-cnt').text)
    if reply_count > 0:
        post.find_element_by_class_name('msg-contain').click()

        replies_xpath = '//li[@aid="{aid}"]//div[@class="comment-container"]//div[@class="comment"]/div[@class="comment-body"]'.format(aid=aid)
        wait.until(EC.presence_of_element_located((By.XPATH, replies_xpath)))

        for reply in driver.find_elements_by_xpath(replies_xpath):
            replies.append(reply.text)

    posts.append({'contents': contents, 'replies': replies})

pprint(posts)

这会产生以下输出:

[{'contents': u'If we have never had been here, then the earth would have gone on healthy and the way it should, but since we are here there are disruptive thing on the earth that are causing the destruction on the earth. And most of these factors are man made, if not all are man made',
  'replies': [u"Only 8% of the world's CO2 comes from humans though..."]},
 {'contents': u"Yes, I know the temperature changes, but that's natural, it happens another way... now there is a lot of CO2 in the air, the long-wave radiation increases and the heat gets trapped. Greenland is partially melting. Why is it melting now? Well, I guess it kind of melted before, but still, why is it melting more than other times? I have to go with man-made, I need more proof that it is a natural cycle, this time.",
  'replies': [u'The trash we burn and we burry is causing that to happen so its just going to melt']},
 {'contents': u'The most prominent reason is that most of the energy we depend on is coming from the fossil fuels and its burning produced carbon dioxide, the main cause for global climate change. Another reason is that forests are disappearing because of many purposes for human life. Without strong change in our energy source and use, global climate change will get worse.',
  'replies': []},
  ...
 {'contents': u'Solar flares happen on a systematic basis. There are varying degrees of these flares. When these low, medium, and high impact flares happen at the same time an incredibly large amount of bad chemicals are released and cause more impact than human activity. An example is if the low impact flares happened every two years, the medium impact may happen every four years and the high impact may happen every eight years so they would collide quite often.',
  'replies': []}]

您仍然需要改进解决方案来处理底部的 "Load more Arguments" 按钮,以便提取更多参数,但这应该会给您一个很好的起点。