Webscraping 客户评论 - 使用 XPath 的无效选择器错误
Webscraping customer review - Invalid selector error using XPath
我正在尝试使用 selenium 从以下站点提取用户 ID、评级和评论,但它显示“无效的选择器错误”。我认为,我试图定义以获取评论文本的 Xpath 是错误的原因。但我无法解决这个问题。本站link如下:
我使用的代码如下:
#Class for Review webscraping from consumeraffairs.com site
class CarForumCrawler():
def __init__(self, start_link):
self.link_to_explore = start_link
self.comments = pd.DataFrame(columns = ['rating','user_id','comments'])
self.driver = webdriver.Chrome(executable_path=r'C:/Users/mumid/Downloads/chromedriver/chromedriver.exe')
self.driver.get(self.link_to_explore)
self.driver.implicitly_wait(5)
self.extract_data()
self.save_data_to_file()
def extract_data(self):
ids = self.driver.find_elements_by_xpath("//*[contains(@id,'review-')]")
comment_ids = []
for i in ids:
comment_ids.append(i.get_attribute('id'))
for x in comment_ids:
#Extract dates from for each user on a page
user_rating = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]/div[1]/div/img')[0]
rating = user_rating.get_attribute('data-rating')
#Extract user ids from each user on a page
userid_element = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]/div[2]/div[2]/strong')[0]
userid = userid_element.get_attribute('itemprop')
#Extract Message for each user on a page
user_message = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]/text()')[0]
comment = user_message.text
#Adding date, userid and comment for each user in a dataframe
self.comments.loc[len(self.comments)] = [rating,userid,comment]
def save_data_to_file(self):
#we save the dataframe content to a CSV file
self.comments.to_csv ('Tesla_rating-6.csv', index = None, header=True)
def close_spider(self):
#end the session
self.driver.quit()
try:
url = 'https://www.consumeraffairs.com/automotive/tesla_motors.html'
mycrawler = CarForumCrawler(url)
mycrawler.close_spider()
except:
raise
我收到的错误如下:
此外,我尝试跟踪的 xpath 来自以下 HTML
您看到的是经典错误...
因为 find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]/text()')[0]
会 select 属性,相反你需要传递一个 xpath 表达式,selects 元素。
您需要更改为:
user_message = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]')[0]
参考资料
您可以在以下位置找到一些相关的详细讨论:
- invalid selector: The result of the xpath expression "//a[contains(@href, 'mailto')]/@href" is: [object Attr] getting the href attribute with Selenium
我正在尝试使用 selenium 从以下站点提取用户 ID、评级和评论,但它显示“无效的选择器错误”。我认为,我试图定义以获取评论文本的 Xpath 是错误的原因。但我无法解决这个问题。本站link如下:
我使用的代码如下:
#Class for Review webscraping from consumeraffairs.com site
class CarForumCrawler():
def __init__(self, start_link):
self.link_to_explore = start_link
self.comments = pd.DataFrame(columns = ['rating','user_id','comments'])
self.driver = webdriver.Chrome(executable_path=r'C:/Users/mumid/Downloads/chromedriver/chromedriver.exe')
self.driver.get(self.link_to_explore)
self.driver.implicitly_wait(5)
self.extract_data()
self.save_data_to_file()
def extract_data(self):
ids = self.driver.find_elements_by_xpath("//*[contains(@id,'review-')]")
comment_ids = []
for i in ids:
comment_ids.append(i.get_attribute('id'))
for x in comment_ids:
#Extract dates from for each user on a page
user_rating = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]/div[1]/div/img')[0]
rating = user_rating.get_attribute('data-rating')
#Extract user ids from each user on a page
userid_element = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]/div[2]/div[2]/strong')[0]
userid = userid_element.get_attribute('itemprop')
#Extract Message for each user on a page
user_message = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]/text()')[0]
comment = user_message.text
#Adding date, userid and comment for each user in a dataframe
self.comments.loc[len(self.comments)] = [rating,userid,comment]
def save_data_to_file(self):
#we save the dataframe content to a CSV file
self.comments.to_csv ('Tesla_rating-6.csv', index = None, header=True)
def close_spider(self):
#end the session
self.driver.quit()
try:
url = 'https://www.consumeraffairs.com/automotive/tesla_motors.html'
mycrawler = CarForumCrawler(url)
mycrawler.close_spider()
except:
raise
我收到的错误如下:
此外,我尝试跟踪的 xpath 来自以下 HTML
您看到的是经典错误...
因为 find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]/text()')[0]
会 select 属性,相反你需要传递一个 xpath 表达式,selects 元素。
您需要更改为:
user_message = self.driver.find_elements_by_xpath('//*[@id="' + x +'"]]/div[3]/p[2]')[0]
参考资料
您可以在以下位置找到一些相关的详细讨论:
- invalid selector: The result of the xpath expression "//a[contains(@href, 'mailto')]/@href" is: [object Attr] getting the href attribute with Selenium