PhantomJS 浏览器未加载某些网址的 javascript
PhantomJS browser not loading javascript for certain urls
我正在尝试下载 Google 趋势数据并使用 PhantomJS 加载页面并提取所需数据。当我 运行 我的代码只使用 url 中的一个关键字时(示例 url:https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue), it works fine. As soon as I add a second keyword (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red)PhantomJS 不再正确加载页面并且我无法找到数据我需要。
我曾尝试增加浏览器等待的时间,并尝试了许多不同的关键字但没有成功。我没有想法,只是不明白为什么我的程序在稍微更改 url 后不再工作(两个 url 的标签和页面结构几乎相同,所以问题不在于标签不再与以前同名)
这是有问题的代码:
# Reading google trends data
google_trend_array = []
url = 'https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red'
browser = webdriver.PhantomJS('...\phantomjs-2.1.1-windows\bin\phantomjs.exe')
ran_smooth = False
time_to_sleep = 3
# ran_smooth makes sure that page has loaded and necessary code was extracted, if not it will try to load the page again
while ran_smooth is False:
browser.get(url)
time.sleep(time_to_sleep)
soup = BeautifulSoup(browser.page_source, "html.parser") # BS object to use bs4
table = soup.find('div', {'aria-label': 'A tabular representation of the data in the chart.'})
# If page didn't load, this try will throw an exception
try:
# Copies all the data out of google trends table
for col in table.findAll('td'):
# google has both dates and trend values, the following function ensures that we only read the trend values
if col.string.isdigit() is True:
trend_number = int(col.string)
google_trend_array.append(trend_number)
# program ran through, leave while loop
ran_smooth = True
except AttributeError:
print 'page not loading for term ' + str(term_to_trend) + ', trying again...'
time_to_sleep += 1 # increase time to sleep so that page can load
print google_trend_array
你应该看看pytrends,而不是重新发明轮子。
这是一个小例子:如何从Google趋势中提取数据框:
import pytrends.request
google_username = "<your_login>@gmail.com"
google_password = "<your_password>"
# connect to Google
pytrend = pytrends.request.TrendReq(google_username, google_password, custom_useragent='My Pytrends Script')
trend_payload = {'q': 'Pizza, Italian, Spaghetti, Breadsticks, Sausage', 'cat': '0-71'}
# trend = pytrend.trend(trend_payload)
df = pytrend.trend(trend_payload, return_type='dataframe')
您将获得:
breadsticks italian pizza sausage spaghetti
Date
2004-01-01 0.0 9.0 34.0 3.0 3.0
2004-02-01 0.0 10.0 32.0 2.0 3.0
2004-03-01 0.0 10.0 32.0 2.0 3.0
2004-04-01 0.0 9.0 31.0 2.0 2.0
2004-05-01 0.0 9.0 32.0 2.0 2.0
2004-06-01 0.0 8.0 29.0 2.0 3.0
2004-07-01 0.0 8.0 34.0 2.0 3.0
[...]
我正在尝试下载 Google 趋势数据并使用 PhantomJS 加载页面并提取所需数据。当我 运行 我的代码只使用 url 中的一个关键字时(示例 url:https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue), it works fine. As soon as I add a second keyword (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red)PhantomJS 不再正确加载页面并且我无法找到数据我需要。 我曾尝试增加浏览器等待的时间,并尝试了许多不同的关键字但没有成功。我没有想法,只是不明白为什么我的程序在稍微更改 url 后不再工作(两个 url 的标签和页面结构几乎相同,所以问题不在于标签不再与以前同名) 这是有问题的代码:
# Reading google trends data
google_trend_array = []
url = 'https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red'
browser = webdriver.PhantomJS('...\phantomjs-2.1.1-windows\bin\phantomjs.exe')
ran_smooth = False
time_to_sleep = 3
# ran_smooth makes sure that page has loaded and necessary code was extracted, if not it will try to load the page again
while ran_smooth is False:
browser.get(url)
time.sleep(time_to_sleep)
soup = BeautifulSoup(browser.page_source, "html.parser") # BS object to use bs4
table = soup.find('div', {'aria-label': 'A tabular representation of the data in the chart.'})
# If page didn't load, this try will throw an exception
try:
# Copies all the data out of google trends table
for col in table.findAll('td'):
# google has both dates and trend values, the following function ensures that we only read the trend values
if col.string.isdigit() is True:
trend_number = int(col.string)
google_trend_array.append(trend_number)
# program ran through, leave while loop
ran_smooth = True
except AttributeError:
print 'page not loading for term ' + str(term_to_trend) + ', trying again...'
time_to_sleep += 1 # increase time to sleep so that page can load
print google_trend_array
你应该看看pytrends,而不是重新发明轮子。
这是一个小例子:如何从Google趋势中提取数据框:
import pytrends.request
google_username = "<your_login>@gmail.com"
google_password = "<your_password>"
# connect to Google
pytrend = pytrends.request.TrendReq(google_username, google_password, custom_useragent='My Pytrends Script')
trend_payload = {'q': 'Pizza, Italian, Spaghetti, Breadsticks, Sausage', 'cat': '0-71'}
# trend = pytrend.trend(trend_payload)
df = pytrend.trend(trend_payload, return_type='dataframe')
您将获得:
breadsticks italian pizza sausage spaghetti
Date
2004-01-01 0.0 9.0 34.0 3.0 3.0
2004-02-01 0.0 10.0 32.0 2.0 3.0
2004-03-01 0.0 10.0 32.0 2.0 3.0
2004-04-01 0.0 9.0 31.0 2.0 2.0
2004-05-01 0.0 9.0 32.0 2.0 2.0
2004-06-01 0.0 8.0 29.0 2.0 3.0
2004-07-01 0.0 8.0 34.0 2.0 3.0
[...]