BeautifulSoup 不返回 Twitch.tv 观看次数
BeautifulSoup Not Returning Twitch.tv Viewcount
我正在尝试使用 Python 在 www.twitch.tv/directory 上吸引观众。我试过基本的 BeautifulSoup 脚本:
url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()
这让我 html 没有显示实际的观众人数。
然后我尝试使用参数 ajax 数据。 From this thread
param = {"action": "getcategory",
"br": "f21",
"category": "dress",
"pageno": "",
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "https://www.twitch.tv/directory"
# Also tried with the headers parameter headers={"User-Agent":"Mozilla/5.0...
js = requests.get(url,params=param).json()
但是我收到 JSONDecodeError: Expecting value: line 1 column 1 (char 0)
错误。
从那时起我转向了 selenium
driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")
这些只是产生与标准 BeautifulSoup 调用相同的结果。
如能提供有关抓取观看次数的任何帮助,我们将不胜感激。
页面首次加载时不显示统计信息。该页面向 https://gql.twitch.tv/gql 发出 graphql 请求以获取游戏数据。当用户未登录时,graphql 请求请求查询 AnonFrontPage_TopChannels
.
这是 python 中的工作请求:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "AnonFrontPage_TopChannels",
"variables": {"platformType": "all", "isTagsExperiment": True},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
}
},
}
),
headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)
print(json.loads(resp.content))
我已将 Client-Id 包含在请求中。该 ID 似乎不是 session 所独有的,但我想 Twitch 会使它们过期,所以这可能不会永远有效。您将不得不检查未来的 graphql 请求并在将来获取一个新的 Client-Id 或弄清楚如何以编程方式从页面中抓取一个。
这个请求实际上似乎是热门直播频道部分。以下是获取观看次数和标题的方法:
edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]
# games:
[
("Let us GAME", 78250),
("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
("RuneFest 2018 - OSRS Reveals !schedule", 35042),
(None, 25237),
("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
("Reckful - 3v3 with barry and a german", 20399),
]
您需要检查 chrome 网络检查器并找出其他请求的结构以获取更多数据。
下面是目录页面的示例:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "BrowsePage_AllDirectories",
"variables": {
"limit": 30,
"directoryFilters": ["GAMES"],
"isTagsExperiment": True,
"tags": [],
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
}
},
}
),
headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)
edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]
我正在尝试使用 Python 在 www.twitch.tv/directory 上吸引观众。我试过基本的 BeautifulSoup 脚本:
url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()
这让我 html 没有显示实际的观众人数。
然后我尝试使用参数 ajax 数据。 From this thread
param = {"action": "getcategory",
"br": "f21",
"category": "dress",
"pageno": "",
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "https://www.twitch.tv/directory"
# Also tried with the headers parameter headers={"User-Agent":"Mozilla/5.0...
js = requests.get(url,params=param).json()
但是我收到 JSONDecodeError: Expecting value: line 1 column 1 (char 0)
错误。
从那时起我转向了 selenium
driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")
这些只是产生与标准 BeautifulSoup 调用相同的结果。
如能提供有关抓取观看次数的任何帮助,我们将不胜感激。
页面首次加载时不显示统计信息。该页面向 https://gql.twitch.tv/gql 发出 graphql 请求以获取游戏数据。当用户未登录时,graphql 请求请求查询 AnonFrontPage_TopChannels
.
这是 python 中的工作请求:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "AnonFrontPage_TopChannels",
"variables": {"platformType": "all", "isTagsExperiment": True},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
}
},
}
),
headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)
print(json.loads(resp.content))
我已将 Client-Id 包含在请求中。该 ID 似乎不是 session 所独有的,但我想 Twitch 会使它们过期,所以这可能不会永远有效。您将不得不检查未来的 graphql 请求并在将来获取一个新的 Client-Id 或弄清楚如何以编程方式从页面中抓取一个。
这个请求实际上似乎是热门直播频道部分。以下是获取观看次数和标题的方法:
edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]
# games:
[
("Let us GAME", 78250),
("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
("RuneFest 2018 - OSRS Reveals !schedule", 35042),
(None, 25237),
("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
("Reckful - 3v3 with barry and a german", 20399),
]
您需要检查 chrome 网络检查器并找出其他请求的结构以获取更多数据。
下面是目录页面的示例:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "BrowsePage_AllDirectories",
"variables": {
"limit": 30,
"directoryFilters": ["GAMES"],
"isTagsExperiment": True,
"tags": [],
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
}
},
}
),
headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)
edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]