如何解析 tedtalks 的成绩单

Question

无法解析来自 https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript

的视频的转录本

请求将看不到文本实际所在的范围 class。可能是什么问题？

import requests

url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)

有什么办法可以拿到成绩单吗？谢谢你。 I need to reach this no atrribute found

Answer 1

那是因为数据不是通过您正在使用的 link 加载的，而是通过调用他们的 GraphQL 实例加载的。

使用 curl，您可以像这样获取数据：

curl 'https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d

请注意，URL 已进行 urlencoded。您可以导入 from urllib.parse import quote 以使用 quote() 方法对 python.

中的字符串进行 urlencode

所以简单地将上面的curl命令翻译成python。没有魔法，只需设置正确的 headers。如果你很懒，你也可以使用 this 在线转换器，将 curl 命令转换为 python 代码。

这会产生：

import requests
from requests.structures import CaseInsensitiveDict

url = "https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D"

headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"

resp = requests.get(url, headers=headers)
print(resp.content)

输出：

b'{"data":{"translation":{"id":"209255","language" ...

如何解析 tedtalks 的成绩单

How to parse the transcript from tedtalks

python

parsing

python-requests