有没有办法用 python 抓取 HTML 弹出窗口 tables/charts?
Is there a way to scrape HTML popup tables/charts with python?
我目前正在寻找 https://www.bestfightodds.com/ 的 MMA 机器学习项目。我正在专门寻找每个战士的 DraftKings 开场赔率,这是通过单击 DraftKings 列下给定战士的赔率找到的。然后,您会看到一个弹出窗口 table,显示投注赔率如何随时间变化。 table 为您提供开局赔率和最新(当前)赔率。
我在抓取战斗机名称方面没有问题,但我不知道如何在弹出窗口中抓取开局赔率 table。弹出窗口 table 中的 HTML 代码仅在您单击它时出现在检查功能中,这就是为什么当我尝试在站点的 [=25= 中找到它时我得到 'None' ].
到目前为止,这是我的代码:
# Importing packages
from bs4 import BeautifulSoup
import requests
# Specifying website URL
html_text = requests.get('https://www.bestfightodds.com/events/ufc-273-2411').text
soup = BeautifulSoup(html_text, 'lxml')
# Finding values
fighter_names = soup.find_all('span', class_ = 't-b-fcc')
opening_odds = soup.find_all('span, style_ = 'margin-left: 4px; margin-right: 4px;')
for fighter_names in soup.find_all('span', class_ = 't-b-fcc'):
print (fighter_names.get_text())
这里 photo 介绍了在哪里以及如何确定开盘赔率。蓝色方框是点击找到红色的那个,就是我要为所有斗士刮的那个
pop-ups 由 JavaScript 触发,因此您的抓取工具需要能够将 JavaScript 注入网站。我在 GitHub.
上知道 apify.com uses what is called Headless chrome/chromium automation. You can check out this python library Headless Chrome/Chromium automation library (unofficial port of puppeteer)
有趣的小项目。服务器发送的数据是用自定义的JavaScript函数编码的,所以需要使用selenium
或者重写解码函数为Python.
我使用 js2py
直接在 python 中执行 javascript 函数(而不是使用 selenium
- 它将 javascript 函数重写为 python 自动),但您可以根据需要将其重写为 Python:
import json
import js2py
import requests
from bs4 import BeautifulSoup
js_decode_func = r"""function $(e) {
var t,
a,
r,
s,
o,
i,
l = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=',
n = '',
d = 0;
for (e = e.replace(/[^A-Za-z0-9\+\/\=]/g, ''); d < e.length;) t = l.indexOf(e.charAt(d++)) << 2 | (s = l.indexOf(e.charAt(d++))) >> 4,
a = (15 & s) << 4 | (o = l.indexOf(e.charAt(d++))) >> 2,
r = (3 & o) << 6 | (i = l.indexOf(e.charAt(d++))),
n += String.fromCharCode(t),
64 != o && (n += String.fromCharCode(a)),
64 != i && (n += String.fromCharCode(r));
for (var c = '', h = 0, p = c1 = c2 = 0; h < n.length;)(p = n.charCodeAt(h)) < 128 ? (c += String.fromCharCode(p), h++) : 191 < p && p < 224 ? (c2 = n.charCodeAt(h + 1), c += String.fromCharCode((31 & p) << 6 | 63 & c2), h += 2) : (c2 = n.charCodeAt(h + 1), c3 = n.charCodeAt(h + 2), c += String.fromCharCode((15 & p) << 12 | (63 & c2) << 6 | 63 & c3), h += 3);
var u,
f,
m,
g = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~',
y = new String,
$ = g.length;
for (u = 0; u < c.length; u++) m = c.charAt(u),
0 <= (f = g.indexOf(m)) && (m = g.charAt((f + $ / 2) % $)),
y += m;
return y
}"""
js_get_value_func = r"""function $(e) {
return 2 <= e ? '+' + Math.round(100 * (e - 1)) : e < 2 ? '' + Math.round( - 100 / (e - 1)) : 'error'
}"""
decode = js2py.eval_js(js_decode_func)
get_value = js2py.eval_js(js_get_value_func)
url = "https://www.bestfightodds.com/"
api_url = "https://www.bestfightodds.com/api/ggd"
params = {"b": "22", "m": "25728", "p": "1"}
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for td in soup.select("td[data-li]"):
vals = json.loads(td["data-li"])
if len(vals) != 3 or vals[0] != 22: # 22 - DraftKings
continue
params["b"], params["p"], params["m"] = vals
name = td.find_previous(class_="t-b-fcc").text
encoded_text = requests.get(api_url, params=params).text
data = json.loads(decode(encoded_text))
first_value = get_value(data[0]["data"][0]["y"])
print(name, first_value)
打印:
Alexander Volkanovski -450
Chan Sung Jung +340
Aljamain Sterling +320
Petr Yan -425
Mackenzie Dern +120
Tecia Torres -140
Mark Madsen +130
Vinc Pichel -150
Darian Weeks +190
Ian Garry -235
Mickey Gall +145
Mike Malott. -165
Aspen Ladd +155
Raquel Pennington -180
Anthony Hernandez -180
Josh Fremd +155
Aleksei Oleinik -105
Jared Vanderaa -115
Kay Hansen -150
Piera Rodriguez +130
Daniel Santos +175
Julio Arce -210
Belal Muhammad +150
Vicente Luque -170
Devin Clark -160
William Knight +140
Jordan Leavitt +110
Trey Ogden. -130
Elizeu Zaleski Dos Santos -195
Mounir Lazzez +165
Pat Sabatini -305
T.J. Laramie +240
Mayra Bueno Silva -365
Yanan Wu +280
Lina Akhtar Lansberg +245
Pannie Kianzad -310
Chris Barnett +165
Martin Buday -195
Andre Fialho +150
Miguel Baeza -170
Brandon Jenkins +320
Drakkar Klose -425
Jesse Ronson +110
Rafa Garcia -130
Caio Borralho +115
Gadzhi Omargadzhiev -135
Istela Nunes -190
Sam Hughes +160
Heili Alateng -180
Kevin Croom +155
Carla Esparza +150
Rose Namajunas -170
Glover Teixeira +155
Jiri Prochazka -180
Dustin Poirier -435
Nate Diaz +330
Charles Oliveira -160
Justin Gaethje +140
Gilbert Burns +280
Khamzat Chimaev -365
Arman Tsarukyan -335
Joel Alvarez +260
Calvin Cattar +170
Giga Chikadze -200
我目前正在寻找 https://www.bestfightodds.com/ 的 MMA 机器学习项目。我正在专门寻找每个战士的 DraftKings 开场赔率,这是通过单击 DraftKings 列下给定战士的赔率找到的。然后,您会看到一个弹出窗口 table,显示投注赔率如何随时间变化。 table 为您提供开局赔率和最新(当前)赔率。
我在抓取战斗机名称方面没有问题,但我不知道如何在弹出窗口中抓取开局赔率 table。弹出窗口 table 中的 HTML 代码仅在您单击它时出现在检查功能中,这就是为什么当我尝试在站点的 [=25= 中找到它时我得到 'None' ].
到目前为止,这是我的代码:
# Importing packages
from bs4 import BeautifulSoup
import requests
# Specifying website URL
html_text = requests.get('https://www.bestfightodds.com/events/ufc-273-2411').text
soup = BeautifulSoup(html_text, 'lxml')
# Finding values
fighter_names = soup.find_all('span', class_ = 't-b-fcc')
opening_odds = soup.find_all('span, style_ = 'margin-left: 4px; margin-right: 4px;')
for fighter_names in soup.find_all('span', class_ = 't-b-fcc'):
print (fighter_names.get_text())
这里 photo 介绍了在哪里以及如何确定开盘赔率。蓝色方框是点击找到红色的那个,就是我要为所有斗士刮的那个
pop-ups 由 JavaScript 触发,因此您的抓取工具需要能够将 JavaScript 注入网站。我在 GitHub.
上知道 apify.com uses what is called Headless chrome/chromium automation. You can check out this python library Headless Chrome/Chromium automation library (unofficial port of puppeteer)有趣的小项目。服务器发送的数据是用自定义的JavaScript函数编码的,所以需要使用selenium
或者重写解码函数为Python.
我使用 js2py
直接在 python 中执行 javascript 函数(而不是使用 selenium
- 它将 javascript 函数重写为 python 自动),但您可以根据需要将其重写为 Python:
import json
import js2py
import requests
from bs4 import BeautifulSoup
js_decode_func = r"""function $(e) {
var t,
a,
r,
s,
o,
i,
l = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=',
n = '',
d = 0;
for (e = e.replace(/[^A-Za-z0-9\+\/\=]/g, ''); d < e.length;) t = l.indexOf(e.charAt(d++)) << 2 | (s = l.indexOf(e.charAt(d++))) >> 4,
a = (15 & s) << 4 | (o = l.indexOf(e.charAt(d++))) >> 2,
r = (3 & o) << 6 | (i = l.indexOf(e.charAt(d++))),
n += String.fromCharCode(t),
64 != o && (n += String.fromCharCode(a)),
64 != i && (n += String.fromCharCode(r));
for (var c = '', h = 0, p = c1 = c2 = 0; h < n.length;)(p = n.charCodeAt(h)) < 128 ? (c += String.fromCharCode(p), h++) : 191 < p && p < 224 ? (c2 = n.charCodeAt(h + 1), c += String.fromCharCode((31 & p) << 6 | 63 & c2), h += 2) : (c2 = n.charCodeAt(h + 1), c3 = n.charCodeAt(h + 2), c += String.fromCharCode((15 & p) << 12 | (63 & c2) << 6 | 63 & c3), h += 3);
var u,
f,
m,
g = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~',
y = new String,
$ = g.length;
for (u = 0; u < c.length; u++) m = c.charAt(u),
0 <= (f = g.indexOf(m)) && (m = g.charAt((f + $ / 2) % $)),
y += m;
return y
}"""
js_get_value_func = r"""function $(e) {
return 2 <= e ? '+' + Math.round(100 * (e - 1)) : e < 2 ? '' + Math.round( - 100 / (e - 1)) : 'error'
}"""
decode = js2py.eval_js(js_decode_func)
get_value = js2py.eval_js(js_get_value_func)
url = "https://www.bestfightodds.com/"
api_url = "https://www.bestfightodds.com/api/ggd"
params = {"b": "22", "m": "25728", "p": "1"}
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for td in soup.select("td[data-li]"):
vals = json.loads(td["data-li"])
if len(vals) != 3 or vals[0] != 22: # 22 - DraftKings
continue
params["b"], params["p"], params["m"] = vals
name = td.find_previous(class_="t-b-fcc").text
encoded_text = requests.get(api_url, params=params).text
data = json.loads(decode(encoded_text))
first_value = get_value(data[0]["data"][0]["y"])
print(name, first_value)
打印:
Alexander Volkanovski -450
Chan Sung Jung +340
Aljamain Sterling +320
Petr Yan -425
Mackenzie Dern +120
Tecia Torres -140
Mark Madsen +130
Vinc Pichel -150
Darian Weeks +190
Ian Garry -235
Mickey Gall +145
Mike Malott. -165
Aspen Ladd +155
Raquel Pennington -180
Anthony Hernandez -180
Josh Fremd +155
Aleksei Oleinik -105
Jared Vanderaa -115
Kay Hansen -150
Piera Rodriguez +130
Daniel Santos +175
Julio Arce -210
Belal Muhammad +150
Vicente Luque -170
Devin Clark -160
William Knight +140
Jordan Leavitt +110
Trey Ogden. -130
Elizeu Zaleski Dos Santos -195
Mounir Lazzez +165
Pat Sabatini -305
T.J. Laramie +240
Mayra Bueno Silva -365
Yanan Wu +280
Lina Akhtar Lansberg +245
Pannie Kianzad -310
Chris Barnett +165
Martin Buday -195
Andre Fialho +150
Miguel Baeza -170
Brandon Jenkins +320
Drakkar Klose -425
Jesse Ronson +110
Rafa Garcia -130
Caio Borralho +115
Gadzhi Omargadzhiev -135
Istela Nunes -190
Sam Hughes +160
Heili Alateng -180
Kevin Croom +155
Carla Esparza +150
Rose Namajunas -170
Glover Teixeira +155
Jiri Prochazka -180
Dustin Poirier -435
Nate Diaz +330
Charles Oliveira -160
Justin Gaethje +140
Gilbert Burns +280
Khamzat Chimaev -365
Arman Tsarukyan -335
Joel Alvarez +260
Calvin Cattar +170
Giga Chikadze -200