Facebook 群组 Post 仅使用 Selenium 进行抓取 Returns 一个 Post
Facebook Group Post Scraping Using Selenium Only Returns One Post
我正在构建一个 Facebook Group Scraper,我已经设法编写了登录 + 抓取名称的代码,但由于某种原因,我的代码只返回一个结果而不是该页面的所有帖子,如我所愿。
这是我的代码:
for result in driver.find_elements_by_xpath('//div[@class="rq0escxv l9j0dhe7 du4w35lb fhuww2h9 hpfvmrgz gile2uim pwa15fzy g5gj957u aov4n071 oi9244e8 bi6gxh9e h676nmdw aghb5jc5"]'):
poster = result.find_element_by_xpath('//a[@class="oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl oo9gr5id gpro0wi8 lrazzd5p"]/strong/span').text
description = result.find_element_by_xpath('//div[@class="kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql ii04i59q"]').text
groupcomments.append({
'poster' : poster,
'description' : description,
})
print(groupcomments)
这是 Facebook 源代码的片段(您可以在这里自己找到它:https://www.facebook.com/groups/286175922122417)
<div data-pagelet="GroupFeed"><div class="j83agx80 l9j0dhe7 k4urcfbm"><div class="rq0escxv l9j0dhe7 du4w35lb hybvsw6c io0zqebd m5lcvass fbipl8qg nwvqtn77 k4urcfbm ni8dbmo4 stjgntxs sbcfpzgs" style="border-radius: max(0px, min(8px, ((100vw - 4px) - 100%) * 9999)) / 8px;"><div class="ihqw7lf3"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t pfnyh3mw d2edcug0 e5nlhep0 aodizinl"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t buofh1pr tgvbjcpo"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t pfnyh3mw d2edcug0 hv4rvrfc dati1w0a"><div class="j83agx80 cbu4d94t ew0dbk1b irj2b8pg"><div class="qzhwtbm6 knvmm38d"><span class="d2edcug0 hpfvmrgz qv66sw1b c1et5uql oi732d6d ik7dh3pa ht8s03o8 a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb iv3no6db a5q79mjw g1cxx5fr lrazzd5p oo9gr5id" dir="auto"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 pfnyh3mw i1fnvgqd bp9cbjyn owycx6da btwxx1t3 jeutjz8y"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t g5gj957u d2edcug0 hpfvmrgz rj1gh0hx buofh1pr"
有什么想法可以获取我正在寻找的所有信息吗?提前致谢:)
我设法使用 BeautifulSoup HTML scraper 抓取了我想要的内容,使用我正在寻找的信息的 xpath 简单地抓取了信息(这不是 100% 万无一失的解决方案,因为那些可以改变,但它可以很容易地在代码中替换,所以我想总比没有好...)
while True:
soup=BeautifulSoup(driver.page_source,"html.parser")
all_posts=soup.find_all("div",{"class":"du4w35lb k4urcfbm l9j0dhe7 sjgh65i0"})
for post in all_posts:
try:
name=post.find("a",{"class":"oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl oo9gr5id gpro0wi8 lrazzd5p"}).get_text()
except:
name="not found"
print(name)
如果你想要更深入的教程,我还制作了一个视频来展示我是如何编写的,you can watch it here(你也可以在视频说明中找到完整的代码)
//做足够的滚动
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
//重复报废
我正在构建一个 Facebook Group Scraper,我已经设法编写了登录 + 抓取名称的代码,但由于某种原因,我的代码只返回一个结果而不是该页面的所有帖子,如我所愿。
这是我的代码:
for result in driver.find_elements_by_xpath('//div[@class="rq0escxv l9j0dhe7 du4w35lb fhuww2h9 hpfvmrgz gile2uim pwa15fzy g5gj957u aov4n071 oi9244e8 bi6gxh9e h676nmdw aghb5jc5"]'):
poster = result.find_element_by_xpath('//a[@class="oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl oo9gr5id gpro0wi8 lrazzd5p"]/strong/span').text
description = result.find_element_by_xpath('//div[@class="kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql ii04i59q"]').text
groupcomments.append({
'poster' : poster,
'description' : description,
})
print(groupcomments)
这是 Facebook 源代码的片段(您可以在这里自己找到它:https://www.facebook.com/groups/286175922122417)
<div data-pagelet="GroupFeed"><div class="j83agx80 l9j0dhe7 k4urcfbm"><div class="rq0escxv l9j0dhe7 du4w35lb hybvsw6c io0zqebd m5lcvass fbipl8qg nwvqtn77 k4urcfbm ni8dbmo4 stjgntxs sbcfpzgs" style="border-radius: max(0px, min(8px, ((100vw - 4px) - 100%) * 9999)) / 8px;"><div class="ihqw7lf3"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t pfnyh3mw d2edcug0 e5nlhep0 aodizinl"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t buofh1pr tgvbjcpo"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t pfnyh3mw d2edcug0 hv4rvrfc dati1w0a"><div class="j83agx80 cbu4d94t ew0dbk1b irj2b8pg"><div class="qzhwtbm6 knvmm38d"><span class="d2edcug0 hpfvmrgz qv66sw1b c1et5uql oi732d6d ik7dh3pa ht8s03o8 a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb iv3no6db a5q79mjw g1cxx5fr lrazzd5p oo9gr5id" dir="auto"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 pfnyh3mw i1fnvgqd bp9cbjyn owycx6da btwxx1t3 jeutjz8y"><div class="rq0escxv l9j0dhe7 du4w35lb j83agx80 cbu4d94t g5gj957u d2edcug0 hpfvmrgz rj1gh0hx buofh1pr"
有什么想法可以获取我正在寻找的所有信息吗?提前致谢:)
我设法使用 BeautifulSoup HTML scraper 抓取了我想要的内容,使用我正在寻找的信息的 xpath 简单地抓取了信息(这不是 100% 万无一失的解决方案,因为那些可以改变,但它可以很容易地在代码中替换,所以我想总比没有好...)
while True:
soup=BeautifulSoup(driver.page_source,"html.parser")
all_posts=soup.find_all("div",{"class":"du4w35lb k4urcfbm l9j0dhe7 sjgh65i0"})
for post in all_posts:
try:
name=post.find("a",{"class":"oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl oo9gr5id gpro0wi8 lrazzd5p"}).get_text()
except:
name="not found"
print(name)
如果你想要更深入的教程,我还制作了一个视频来展示我是如何编写的,you can watch it here(你也可以在视频说明中找到完整的代码)
//做足够的滚动
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
//重复报废