如何使用 beautifulsoup 抓取 javascript 文本
How to scrape javascript text using beautifulsoup
我正在尝试使用 Python:
在 html 中获取由外部脚本生成的文本
<td class="headerlast item" rowspan="2" colspan="1" id="Party_ATP021_1">
<div id="Party_ATP021mod"></div>
<script type="text/javascript">
var myTooltip = new YAHOO.widget.Tooltip("Party_ATP021tip", {
context:"Party_ATP021_1", text:"Sozialdemokratische Partei Österreichs
(Social Democratic Party of Austria)", showDelay:1000,
autodismissdelay:5000,iframe:true, preventoverlap:true } );
</script>
SPÖ
</td>
我正在尝试获取:"SPÖ" 但没有成功。
到目前为止,我能够获得脚本中使用的 ID:
import requests
from bs4 import BeautifulSoup
import re
from fake_useragent import UserAgent
ua = UserAgent()
link = 'http://eed.nsd.uib.no/webview/velocity?v=2&mode=cube&cube=http%3A%2F%2F129.177.90.166%3A80%2Fobj%2FfCube%2FSIEP2004%21Display_C1&study=http%3A%2F%2F129.177.90.166%3A80%2Fobj%2FfStudy%2FSIEP2004%21Display'
headers ={'user-agent': str(ua.random)}
result_page = BeautifulSoup(requests.get(link, headers=headers,
timeout=10).text, 'html.parser')
for td in result_page.find_all('td', {'class': 'headerlast item'})[1:]:
print(td.get('id'))
有什么帮助吗?
非常感谢!
对于您的示例数据,您可以在脚本元素之后使用 select with the css selector td.headerlast.item script
and get the next_sibling。
html_doc = """
<td class="headerlast item" rowspan="2" colspan="1" id="Party_ATP021_1">
<div id="Party_ATP021mod"></div>
<script type="text/javascript">
var myTooltip = new YAHOO.widget.Tooltip("Party_ATP021tip", {
context:"Party_ATP021_1", text:"Sozialdemokratische Partei Österreichs
(Social Democratic Party of Austria)", showDelay:1000,
autodismissdelay:5000,iframe:true, preventoverlap:true } );
</script>
SPÖ
</td>
"""
from bs4 import BeautifulSoup
result_page = BeautifulSoup(html_doc, 'html.parser')
for scrpt in result_page.select("td.headerlast.item script"):
print(scrpt.next_sibling.strip())
这将导致:
SPÖ
我正在尝试使用 Python:
在 html 中获取由外部脚本生成的文本<td class="headerlast item" rowspan="2" colspan="1" id="Party_ATP021_1">
<div id="Party_ATP021mod"></div>
<script type="text/javascript">
var myTooltip = new YAHOO.widget.Tooltip("Party_ATP021tip", {
context:"Party_ATP021_1", text:"Sozialdemokratische Partei Österreichs
(Social Democratic Party of Austria)", showDelay:1000,
autodismissdelay:5000,iframe:true, preventoverlap:true } );
</script>
SPÖ
</td>
我正在尝试获取:"SPÖ" 但没有成功。
到目前为止,我能够获得脚本中使用的 ID:
import requests
from bs4 import BeautifulSoup
import re
from fake_useragent import UserAgent
ua = UserAgent()
link = 'http://eed.nsd.uib.no/webview/velocity?v=2&mode=cube&cube=http%3A%2F%2F129.177.90.166%3A80%2Fobj%2FfCube%2FSIEP2004%21Display_C1&study=http%3A%2F%2F129.177.90.166%3A80%2Fobj%2FfStudy%2FSIEP2004%21Display'
headers ={'user-agent': str(ua.random)}
result_page = BeautifulSoup(requests.get(link, headers=headers,
timeout=10).text, 'html.parser')
for td in result_page.find_all('td', {'class': 'headerlast item'})[1:]:
print(td.get('id'))
有什么帮助吗? 非常感谢!
对于您的示例数据,您可以在脚本元素之后使用 select with the css selector td.headerlast.item script
and get the next_sibling。
html_doc = """
<td class="headerlast item" rowspan="2" colspan="1" id="Party_ATP021_1">
<div id="Party_ATP021mod"></div>
<script type="text/javascript">
var myTooltip = new YAHOO.widget.Tooltip("Party_ATP021tip", {
context:"Party_ATP021_1", text:"Sozialdemokratische Partei Österreichs
(Social Democratic Party of Austria)", showDelay:1000,
autodismissdelay:5000,iframe:true, preventoverlap:true } );
</script>
SPÖ
</td>
"""
from bs4 import BeautifulSoup
result_page = BeautifulSoup(html_doc, 'html.parser')
for scrpt in result_page.select("td.headerlast.item script"):
print(scrpt.next_sibling.strip())
这将导致:
SPÖ