无法使用 lxml 从网站上抓取信息
Not able to scrape information from a website using lxml
我正在尝试抓取 beeradvocate.com 上用户评论啤酒的数据,以分析用户对不同啤酒类型的态度。但是我只能有前几页的结果,剩下的是空的
情况:
- 有 500 种不同类型的啤酒,每种啤酒都有不同数量的评分和评论
- 站点只显示 1 页访客结果,要查看所有信息,您需要登录
我的做法
- 获取啤酒link,每种啤酒的评级数来定义每种啤酒的循环范围
- 使用请求会话和 post
登录
def review_scrape (beer_link, number_of_ratings):
reviews=[]
rate =[]
for pages_i in range(0,int(number_of_ratings),25): #site shows 25 resulst/page
session = requests.session() # Start the session
payload = {'login':'suzie102', 'password':''}
page1 = session.post("https://www.beeradvocate.com/community/login/login", data=payload)
url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)
page1= session.get(url)
time.sleep(3)
soup1 = lxml.html.fromstring(page1.text)
rate_i = soup1.xpath('//span[@class = "muted"]/text()')[8::3]
print(url)
reviews_i = soup1.xpath('//div/text()')
reviews.append(reviews_i)
print(len(reviews))
rate.append(rate_i)
return rate,reviews
结果:
我只看到一个问题。
url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)
/ 冗余,你需要的是
url = beer_link+'?view=beer&sort=&start=%d'%(pages_i)
这就是为什么在您的链接打印中有 //?view 的原因。
我可以看到有指向下一页的锚链接“下一页”。我会推荐 while 循环或递归。
除此之外,我看不出您的脚本还缺少什么。其他一切看起来都井井有条,应该可以正常工作。
如果您能提供更多详细信息,我们可能会有更多工作要做。
update,感谢大家的评论,我试着用selenium来抓取。现在有效
def webstite_scrape_p2 (beer_link, number_of_ratings):
driver = webdriver.Chrome('/home/sam/Downloads/chromedriver')
url = 'https://www.beeradvocate.com/community/login/'
driver.get(url)
loginelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dd//input[@name ="login"]')))
loginelement.send_keys('suzie102')
pwelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit"]//dd//ul//li[@id = "ctrl_pageLogin_registered_Disabler"]//input[@name ="password"]')))
pwelement.send_keys('')
page_click = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit submitUnit"]//dd//input[@type ="submit"]')))
page_click.click()
rate = []
reviews =[]
avg_user =[]
for link, i in zip(beer_link, number_of_rev):
for pages_i in tqdm(range(0,int(i),25)): #site shows 25 resulst/page)
new_url = link+'?view=beer&sort=&start=%d'%(pages_i)
print(new_url)
driver.get(new_url)
#print(driver.find_element_by_name("hideRatings").is_selected())
#check_box = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH, '//form[@style="display:inline;margin:0;padding:0;"]//input[@type = "checkbox"]')))#check_box.click()
#check_box.click()
time.sleep(5)
driver.get(new_url)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
rate_i = [ i.get_text() for i in soup.find_all('span', class_ = "muted")][8::3]
rate.append(rate_i)
reviews_i = [ i.get_text() for i in soup.find_all('div')]
reviews.append(reviews_i)
avg_i = [i.get_text() for i in soup.find_all('span', class_= "BAscore_norm")]
avg_user.append(avg_i)
return rate, reviews, avg_user
我正在尝试抓取 beeradvocate.com 上用户评论啤酒的数据,以分析用户对不同啤酒类型的态度。但是我只能有前几页的结果,剩下的是空的
情况:
- 有 500 种不同类型的啤酒,每种啤酒都有不同数量的评分和评论
- 站点只显示 1 页访客结果,要查看所有信息,您需要登录
我的做法
- 获取啤酒link,每种啤酒的评级数来定义每种啤酒的循环范围
- 使用请求会话和 post 登录
def review_scrape (beer_link, number_of_ratings):
reviews=[]
rate =[]
for pages_i in range(0,int(number_of_ratings),25): #site shows 25 resulst/page
session = requests.session() # Start the session
payload = {'login':'suzie102', 'password':''}
page1 = session.post("https://www.beeradvocate.com/community/login/login", data=payload)
url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)
page1= session.get(url)
time.sleep(3)
soup1 = lxml.html.fromstring(page1.text)
rate_i = soup1.xpath('//span[@class = "muted"]/text()')[8::3]
print(url)
reviews_i = soup1.xpath('//div/text()')
reviews.append(reviews_i)
print(len(reviews))
rate.append(rate_i)
return rate,reviews
结果:
我只看到一个问题。
url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)
/ 冗余,你需要的是 url = beer_link+'?view=beer&sort=&start=%d'%(pages_i)
这就是为什么在您的链接打印中有 //?view 的原因。
我可以看到有指向下一页的锚链接“下一页”。我会推荐 while 循环或递归。
除此之外,我看不出您的脚本还缺少什么。其他一切看起来都井井有条,应该可以正常工作。
如果您能提供更多详细信息,我们可能会有更多工作要做。
update,感谢大家的评论,我试着用selenium来抓取。现在有效
def webstite_scrape_p2 (beer_link, number_of_ratings):
driver = webdriver.Chrome('/home/sam/Downloads/chromedriver')
url = 'https://www.beeradvocate.com/community/login/'
driver.get(url)
loginelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dd//input[@name ="login"]')))
loginelement.send_keys('suzie102')
pwelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit"]//dd//ul//li[@id = "ctrl_pageLogin_registered_Disabler"]//input[@name ="password"]')))
pwelement.send_keys('')
page_click = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit submitUnit"]//dd//input[@type ="submit"]')))
page_click.click()
rate = []
reviews =[]
avg_user =[]
for link, i in zip(beer_link, number_of_rev):
for pages_i in tqdm(range(0,int(i),25)): #site shows 25 resulst/page)
new_url = link+'?view=beer&sort=&start=%d'%(pages_i)
print(new_url)
driver.get(new_url)
#print(driver.find_element_by_name("hideRatings").is_selected())
#check_box = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH, '//form[@style="display:inline;margin:0;padding:0;"]//input[@type = "checkbox"]')))#check_box.click()
#check_box.click()
time.sleep(5)
driver.get(new_url)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
rate_i = [ i.get_text() for i in soup.find_all('span', class_ = "muted")][8::3]
rate.append(rate_i)
reviews_i = [ i.get_text() for i in soup.find_all('div')]
reviews.append(reviews_i)
avg_i = [i.get_text() for i in soup.find_all('span', class_= "BAscore_norm")]
avg_user.append(avg_i)
return rate, reviews, avg_user