Python lxml 未拾取标签
Python lxml not picking up tags
你好,我正在尝试通过网络抓取这个选举季的 CNN 初选结果,并用它做一些机器学习。我正在使用 Python 3.5,所以在研究了一下之后,我看到我可以使用 lxml 和 BeautifulSoup 以及请求来完成它。在使用 BeautifulSoup 失败后(我尝试使用 XPath 但它没有拾取它),我尝试使用 lxml。在爱荷华州的主要页面(以及目前的每个州),CNN 根据每个候选人的县和选票百分比对其进行细分。查看 html 页面后,我看到每个县名的存储方式是,县名是 h2 标签的一部分,紧跟在 div 标签之后(连同 class 属性),并且每个县依此类推。因此,我使用 CSSSelector 来尝试捕获(因为 h2 总是在县 div 之后)。 html 部分如下所示:
<div class="race-results__county-header race-results__county-name section-header__column" data-reactid=".0.4.3.0.0.0.0.[=11=].0.[=11=]">
<h2 class="section-heading" data-reactid=".0.4.3.0.0.0.0.[=11=].0.[=11=].0">Adair</h2>
</div>
代码如下所示:
from lxml import html
import requests
page = requests.get('http://www.cnn.com/election/primaries/counties/ia/Rep').text
doc = html.fromstring(page)
link = doc.cssselect("div h2")
print(link)
但是,当我尝试打印 link 时,什么也没有(只是一个空数组 [])。这是 html 布局、代码或解析器的问题吗?我正在使用 JetBeans 的 PyCharm,但我认为这与它没有任何关系。我对这些东西很陌生,所以任何其他方法将不胜感激。
问题是,该页面不包含您期望的结果,因为它们可能是通过 JavaScript.
呈现的
当我从给定 url 下载内容时,没有 <h2>
元素,但我发现有一条消息:请启用 JavaScript 以查看 CNN's 2016 年选举中心。
您没有获取数据,因为它们不在页面上。
不要被事实搞糊涂了,您的浏览器可能会向您显示 <h2>
元素 - 那是因为 JavaScript 已经将它放入其中。
提示:检查一下,页面加载的是什么JSON文件。很可能,某些文件会为您的任务提供随时可用的数据。在我的网络浏览器中使用 F12(然后刷新页面)我看到许多 JSON 文件,其中一些提供了有关候选人的数据。
例如url: http://data.cnn.com/ELECTION/2016primary/candidates/can1187.json return 以下内容(缩写):
{
"candidateInfo": {
"id": 1187,
"fname": "Mike",
"lname": "Huckabee",
"party": "Rep",
"rd": "1",
"pd": "0",
"td": "1",
"d_nom": 1237,
"inrace": true,
"nominee": false,
"rd_k": "1460",
"td_k": 2472,
"dpct": 0,
"dpct_nom": 50,
"states": [
{
"state": "Alabama",
"code": "AL",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Alaska",
"code": "AK",
"electiondate": "20160301",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Arizona",
"code": "AZ",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Arkansas",
"code": "AR",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Iowa",
"code": "IA",
"electiondate": "20160201",
"primarytype": "caucus",
"candidates": [
{
"id": 1187,
"rd": "1",
"pd": "0",
"td": "1",
"winner": false
}
]
},
{
"state": "Kansas",
"code": "KS",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Kentucky",
"code": "KY",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Louisiana",
"code": "LA",
"electiondate": "20160305",
"primarytype": "primary",
"candidates": []
},
{
"state": "Maine",
"code": "ME",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Maryland",
"code": "MD",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Massachusetts",
"code": "MA",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Michigan",
"code": "MI",
"electiondate": "20160308",
"primarytype": "primary",
"candidates": []
},
{
"state": "Minnesota",
"code": "MN",
"electiondate": "20160301",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Mississippi",
"code": "MS",
"electiondate": "20160308",
"primarytype": "primary",
"candidates": []
},
{
"state": "Missouri",
"code": "MO",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "Montana",
"code": "MT",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Nebraska",
"code": "NE",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Nevada",
"code": "NV",
"electiondate": "20160223",
"primarytype": "caucus",
"candidates": []
},
{
"state": "New Hampshire",
"code": "NH",
"electiondate": "20160209",
"primarytype": "primary",
"candidates": []
},
{
"state": "New Jersey",
"code": "NJ",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "New Mexico",
"code": "NM",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "New York",
"code": "NY",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "North Carolina",
"code": "NC",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "North Dakota",
"code": "ND",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Ohio",
"code": "OH",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "Oklahoma",
"code": "OK",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Oregon",
"code": "OR",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Virgin Islands",
"code": "VI",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Northern Marianas",
"code": "MP",
"electiondate": "",
"primarytype": "",
"candidates": []
}
],
"races": [
{
"status": "called",
"code": "AR",
"state": "Arkansas",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1457130949809,
"racerank": 6,
"winner": false,
"vpct": 1,
"pctDecimal": "1.2",
"inc": false,
"votes": 4703,
"cvotes": "4,703",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
},
{
"status": "called",
"code": "GA",
"state": "Georgia",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 92,
"ts": 1457130978961,
"racerank": 8,
"winner": false,
"vpct": 0,
"pctDecimal": "0.2",
"inc": false,
"votes": 2615,
"cvotes": "2,615",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
},
{
"status": "called",
"code": "TN",
"state": "Tennessee",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1457131086792,
"racerank": 7,
"winner": false,
"vpct": 0,
"pctDecimal": "0.3",
"inc": false,
"votes": 2404,
"cvotes": "2,404",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 15
},
{
"status": "called",
"code": "IA",
"state": "Iowa",
"polltype": "entrance",
"primarytype": "caucus",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160201",
"pctsrep": 99,
"ts": 1454997428611,
"racerank": 9,
"winner": false,
"vpct": 2,
"pctDecimal": "1.8",
"inc": false,
"votes": 3345,
"cvotes": "3,345",
"rd": "1",
"pd": "0",
"sd": "1",
"td": "1",
"position": 14
},
{
"status": "called",
"code": "AL",
"state": "Alabama",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1456958822650,
"racerank": 8,
"winner": false,
"vpct": 0,
"pctDecimal": "0.3",
"inc": false,
"votes": 2535,
"cvotes": "2,535",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
}
],
"lts": 1458233488340
}
}
你好,我正在尝试通过网络抓取这个选举季的 CNN 初选结果,并用它做一些机器学习。我正在使用 Python 3.5,所以在研究了一下之后,我看到我可以使用 lxml 和 BeautifulSoup 以及请求来完成它。在使用 BeautifulSoup 失败后(我尝试使用 XPath 但它没有拾取它),我尝试使用 lxml。在爱荷华州的主要页面(以及目前的每个州),CNN 根据每个候选人的县和选票百分比对其进行细分。查看 html 页面后,我看到每个县名的存储方式是,县名是 h2 标签的一部分,紧跟在 div 标签之后(连同 class 属性),并且每个县依此类推。因此,我使用 CSSSelector 来尝试捕获(因为 h2 总是在县 div 之后)。 html 部分如下所示:
<div class="race-results__county-header race-results__county-name section-header__column" data-reactid=".0.4.3.0.0.0.0.[=11=].0.[=11=]">
<h2 class="section-heading" data-reactid=".0.4.3.0.0.0.0.[=11=].0.[=11=].0">Adair</h2>
</div>
代码如下所示:
from lxml import html
import requests
page = requests.get('http://www.cnn.com/election/primaries/counties/ia/Rep').text
doc = html.fromstring(page)
link = doc.cssselect("div h2")
print(link)
但是,当我尝试打印 link 时,什么也没有(只是一个空数组 [])。这是 html 布局、代码或解析器的问题吗?我正在使用 JetBeans 的 PyCharm,但我认为这与它没有任何关系。我对这些东西很陌生,所以任何其他方法将不胜感激。
问题是,该页面不包含您期望的结果,因为它们可能是通过 JavaScript.
呈现的当我从给定 url 下载内容时,没有 <h2>
元素,但我发现有一条消息:请启用 JavaScript 以查看 CNN's 2016 年选举中心。
您没有获取数据,因为它们不在页面上。
不要被事实搞糊涂了,您的浏览器可能会向您显示 <h2>
元素 - 那是因为 JavaScript 已经将它放入其中。
提示:检查一下,页面加载的是什么JSON文件。很可能,某些文件会为您的任务提供随时可用的数据。在我的网络浏览器中使用 F12(然后刷新页面)我看到许多 JSON 文件,其中一些提供了有关候选人的数据。
例如url: http://data.cnn.com/ELECTION/2016primary/candidates/can1187.json return 以下内容(缩写):
{
"candidateInfo": {
"id": 1187,
"fname": "Mike",
"lname": "Huckabee",
"party": "Rep",
"rd": "1",
"pd": "0",
"td": "1",
"d_nom": 1237,
"inrace": true,
"nominee": false,
"rd_k": "1460",
"td_k": 2472,
"dpct": 0,
"dpct_nom": 50,
"states": [
{
"state": "Alabama",
"code": "AL",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Alaska",
"code": "AK",
"electiondate": "20160301",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Arizona",
"code": "AZ",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Arkansas",
"code": "AR",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Iowa",
"code": "IA",
"electiondate": "20160201",
"primarytype": "caucus",
"candidates": [
{
"id": 1187,
"rd": "1",
"pd": "0",
"td": "1",
"winner": false
}
]
},
{
"state": "Kansas",
"code": "KS",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Kentucky",
"code": "KY",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Louisiana",
"code": "LA",
"electiondate": "20160305",
"primarytype": "primary",
"candidates": []
},
{
"state": "Maine",
"code": "ME",
"electiondate": "20160305",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Maryland",
"code": "MD",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Massachusetts",
"code": "MA",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Michigan",
"code": "MI",
"electiondate": "20160308",
"primarytype": "primary",
"candidates": []
},
{
"state": "Minnesota",
"code": "MN",
"electiondate": "20160301",
"primarytype": "caucus",
"candidates": []
},
{
"state": "Mississippi",
"code": "MS",
"electiondate": "20160308",
"primarytype": "primary",
"candidates": []
},
{
"state": "Missouri",
"code": "MO",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "Montana",
"code": "MT",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Nebraska",
"code": "NE",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Nevada",
"code": "NV",
"electiondate": "20160223",
"primarytype": "caucus",
"candidates": []
},
{
"state": "New Hampshire",
"code": "NH",
"electiondate": "20160209",
"primarytype": "primary",
"candidates": []
},
{
"state": "New Jersey",
"code": "NJ",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "New Mexico",
"code": "NM",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "New York",
"code": "NY",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "North Carolina",
"code": "NC",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "North Dakota",
"code": "ND",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Ohio",
"code": "OH",
"electiondate": "20160315",
"primarytype": "primary",
"candidates": []
},
{
"state": "Oklahoma",
"code": "OK",
"electiondate": "20160301",
"primarytype": "primary",
"candidates": []
},
{
"state": "Oregon",
"code": "OR",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Virgin Islands",
"code": "VI",
"electiondate": "",
"primarytype": "",
"candidates": []
},
{
"state": "Northern Marianas",
"code": "MP",
"electiondate": "",
"primarytype": "",
"candidates": []
}
],
"races": [
{
"status": "called",
"code": "AR",
"state": "Arkansas",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1457130949809,
"racerank": 6,
"winner": false,
"vpct": 1,
"pctDecimal": "1.2",
"inc": false,
"votes": 4703,
"cvotes": "4,703",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
},
{
"status": "called",
"code": "GA",
"state": "Georgia",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 92,
"ts": 1457130978961,
"racerank": 8,
"winner": false,
"vpct": 0,
"pctDecimal": "0.2",
"inc": false,
"votes": 2615,
"cvotes": "2,615",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
},
{
"status": "called",
"code": "TN",
"state": "Tennessee",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1457131086792,
"racerank": 7,
"winner": false,
"vpct": 0,
"pctDecimal": "0.3",
"inc": false,
"votes": 2404,
"cvotes": "2,404",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 15
},
{
"status": "called",
"code": "IA",
"state": "Iowa",
"polltype": "entrance",
"primarytype": "caucus",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160201",
"pctsrep": 99,
"ts": 1454997428611,
"racerank": 9,
"winner": false,
"vpct": 2,
"pctDecimal": "1.8",
"inc": false,
"votes": 3345,
"cvotes": "3,345",
"rd": "1",
"pd": "0",
"sd": "1",
"td": "1",
"position": 14
},
{
"status": "called",
"code": "AL",
"state": "Alabama",
"polltype": "exit",
"primarytype": "primary",
"cresults": true,
"cmap": true,
"xpoll": true,
"electiondate": "20160301",
"pctsrep": 100,
"ts": 1456958822650,
"racerank": 8,
"winner": false,
"vpct": 0,
"pctDecimal": "0.3",
"inc": false,
"votes": 2535,
"cvotes": "2,535",
"rd": "0",
"pd": "0",
"sd": "0",
"td": "0",
"position": 13
}
],
"lts": 1458233488340
}
}