Beautiful Soup 不会 return 结果
Beautiful Soup won't return results
这是我的代码。它不会 return 任何错误,但它也不会 return 任何结果。
import requests
from bs4 import BeautifulSoup
googtrends = requests.get("https://www.google.com/trends/")
soup = BeautifulSoup(googtrends.content)
links = soup.find_all("a", {"class": "trending-story ng-isolate-scope"})
print links
我还没有解决这个问题,我开始做其他事情,但我将首先尝试使用 selenium,然后尝试将 selenium 与 phantom js 或 zombie js 一起使用,如果仍然不行工作我将使用 pytrends,但我只是检查了它们,你需要一个 gmail 帐户,我有,但我宁愿先尝试让它在没有 api 的情况下工作。
我会post在它工作后回到这里
是的这个页面是由 JS 动态呈现的-让我们尝试一下甚至改变请求header(它失败了并且同样确保JS是原因!)
测试代码-
import requests
from bs4 import BeautifulSoup
my_headers={"Host": "www.google.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,am;q=0.7,zh-HK;q=0.3",
"Accept-Encoding": "gzip, deflate",
"Cookie": "PREF=ID=1111111111111111:FF=0:LD=en:TM=1439993585:LM=1444815129:V=1:S=Zjbb3gK_m_n69Hqv; NID=72=F6UyD0Fr18smDLJe1NzTReJn_5pwZz-PtXM4orYW43oRk2D3vjb0Sy6Bs_Do4J_EjeOulugs_x2P1BZneufegpNxzv7rkY9BPHcfdx9vGOHtJqv2r46UuFI2f5nIZ1Cu4RcT9yS5fZ1SUhel5fHTLbyZWhX-yiPXvZCiQoW4FjZd-3Bwxq8yrpdgmPmf4ufvFNlmTd3y; OGP=-5061451:; OGPC=5061713-3:",
"Connection": "keep-alive"}
googtrends = requests.get("https://www.google.com/trends/",headers=my_headers)
my_content = googtrends.text.encode('utf-8')
soup = BeautifulSoup(my_content,'html.parser')
links = soup.find_all("a", {"class": "trending-story ng-isolate-scope"},href=True)
#Lets try if we are getting correct content from the site
# That site contains "Apple Inc., App Store" so let's check it in the got response
print 'Apple Inc., App Store' in my_content
# It prints false so website is being rendered by JS even header change does not affect
所以试试 webdriver,比如 Firefox 中的 selenium,Chrome,动态执行 JS 的 PhantomJS 等。 最好尝试 API。
这是我的代码。它不会 return 任何错误,但它也不会 return 任何结果。
import requests
from bs4 import BeautifulSoup
googtrends = requests.get("https://www.google.com/trends/")
soup = BeautifulSoup(googtrends.content)
links = soup.find_all("a", {"class": "trending-story ng-isolate-scope"})
print links
我还没有解决这个问题,我开始做其他事情,但我将首先尝试使用 selenium,然后尝试将 selenium 与 phantom js 或 zombie js 一起使用,如果仍然不行工作我将使用 pytrends,但我只是检查了它们,你需要一个 gmail 帐户,我有,但我宁愿先尝试让它在没有 api 的情况下工作。
我会post在它工作后回到这里
是的这个页面是由 JS 动态呈现的-让我们尝试一下甚至改变请求header(它失败了并且同样确保JS是原因!)
测试代码-
import requests
from bs4 import BeautifulSoup
my_headers={"Host": "www.google.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,am;q=0.7,zh-HK;q=0.3",
"Accept-Encoding": "gzip, deflate",
"Cookie": "PREF=ID=1111111111111111:FF=0:LD=en:TM=1439993585:LM=1444815129:V=1:S=Zjbb3gK_m_n69Hqv; NID=72=F6UyD0Fr18smDLJe1NzTReJn_5pwZz-PtXM4orYW43oRk2D3vjb0Sy6Bs_Do4J_EjeOulugs_x2P1BZneufegpNxzv7rkY9BPHcfdx9vGOHtJqv2r46UuFI2f5nIZ1Cu4RcT9yS5fZ1SUhel5fHTLbyZWhX-yiPXvZCiQoW4FjZd-3Bwxq8yrpdgmPmf4ufvFNlmTd3y; OGP=-5061451:; OGPC=5061713-3:",
"Connection": "keep-alive"}
googtrends = requests.get("https://www.google.com/trends/",headers=my_headers)
my_content = googtrends.text.encode('utf-8')
soup = BeautifulSoup(my_content,'html.parser')
links = soup.find_all("a", {"class": "trending-story ng-isolate-scope"},href=True)
#Lets try if we are getting correct content from the site
# That site contains "Apple Inc., App Store" so let's check it in the got response
print 'Apple Inc., App Store' in my_content
# It prints false so website is being rendered by JS even header change does not affect
所以试试 webdriver,比如 Firefox 中的 selenium,Chrome,动态执行 JS 的 PhantomJS 等。 最好尝试 API。