正在为特定脚本类型解析 html
Parsing html for specific script type
在 html 代码中,Vine <script type="application/ld+json">
包含指向页面上所有视频的链接,我如何才能访问此 JSON?
import requests
from bs4 import BeautifulSoup
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
您可以使用 css 选择器:
soup.select("script[type=application/ld+json]")
或find_all设置type="application/ld+json"
:
soup.find_all("script",type="application/ld+json")
两者都给你:
[<script type="application/ld+json">\n {\n "@context": "http://schema.org",\n "@type": "ItemList",\n "url": "https://vine.co/tags/funny",\n "itemListElement": [\n \n {\n "@type": "ListItem",\n "position": 1,\n "url": "https://vine.co/v/iLKgAXeqwqu"\n },\n \n {\n "@type": "ListItem",\n "position": 2,\n "url": "https://vine.co/v/iLK6p2UHDTl"\n },\n \n {\n "@type": "ListItem",\n "position": 3,\n "url": "https://vine.co/v/iLKrbIeXPTH"\n },\n \n {\n "@type": "ListItem",\n "position": 4,\n "url": "https://vine.co/v/iLKrbZ5zir0"\n },\n \n {\n "@type": "ListItem",\n "position": 5,\n "url": "https://vine.co/v/iLKvxUwLUxr"\n },\n \n {\n "@type": "ListItem",\n "position": 6,\n "url": "https://vine.co/v/iLKvnVOd7VA"\n },\n \n {\n "@type": "ListItem",\n "position": 7,\n "url": "https://vine.co/v/iLKv73UQmjB"\n },\n \n {\n "@type": "ListItem",\n "position": 8,\n "url": "https://vine.co/v/iLKvBeO9Fmt"\n },\n \n {\n "@type": "ListItem",\n "position": 9,\n "url": "https://vine.co/v/iLKnrqMDYeD"\n },\n \n {\n "@type": "ListItem",\n "position": 10,\n "url": "https://vine.co/v/iLKnWrjMqwE"\n },\n \n {\n "@type": "ListItem",\n "position": 11,\n "url": "https://vine.co/v/iLK17Bg1wt0"\n },\n \n {\n "@type": "ListItem",\n "position": 12,\n "url": "https://vine.co/v/iLK5ExAZ7WB"\n },\n \n {\n "@type": "ListItem",\n "position": 13,\n "url": "https://vine.co/v/iLK5Eg7vHM7"\n },\n \n {\n "@type": "ListItem",\n "position": 14,\n "url": "https://vine.co/v/iLKitbix3pb"\n },\n \n {\n "@type": "ListItem",\n "position": 15,\n "url": "https://vine.co/v/iLKOleYJhUp"\n },\n \n {\n "@type": "ListItem",\n "position": 16,\n "url": "https://vine.co/v/iLKOTFgXVFQ"\n },\n \n {\n "@type": "ListItem",\n "position": 17,\n "url": "https://vine.co/v/iLKMI6t91xe"\n },\n \n {\n "@type": "ListItem",\n "position": 18,\n "url": "https://vine.co/v/iLKMX6p0TD6"\n },\n \n {\n "@type": "ListItem",\n "position": 19,\n "url": "https://vine.co/v/iLKM6Hh1nzr"\n },\n \n {\n "@type": "ListItem",\n "position": 20,\n "url": "https://vine.co/v/iLKhQWVIAj3"\n }\n \n ]\n }\n </script>]
要把它变成json,你只需要json.loads文本,也因为只有一个,你可以使用 select_one 或 find:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
# js = json.loads(soup.find("script",type="application/ld+json").text)
js = json.loads(soup.select_one("script[type=application/ld+json]").text)
print(js)
这给你:
{u'url': u'https://vine.co/tags/funny', u'@context': u'http://schema.org', u'itemListElement': [{u'url': u'https://vine.co/v/iLKgAXeqwqu', u'position': 1, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK6p2UHDTl', u'position': 2, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbIeXPTH', u'position': 3, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbZ5zir0', u'position': 4, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvxUwLUxr', u'position': 5, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvnVOd7VA', u'position': 6, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKv73UQmjB', u'position': 7, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvBeO9Fmt', u'position': 8, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnrqMDYeD', u'position': 9, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnWrjMqwE', u'position': 10, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK17Bg1wt0', u'position': 11, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5ExAZ7WB', u'position': 12, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5Eg7vHM7', u'position': 13, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKitbix3pb', u'position': 14, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOleYJhUp', u'position': 15, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOTFgXVFQ', u'position': 16, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMI6t91xe', u'position': 17, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMX6p0TD6', u'position': 18, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKM6Hh1nzr', u'position': 19, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKhQWVIAj3', u'position': 20, u'@type': u'ListItem'}], u'@type': u'ItemList'}
最后一步只是解析 js 以获取 url,它们位于您可以使用 js["itemListElement"]
访问的字典列表中:
In [18]: js = json.loads(soup.select_one("script[type=application/ld+json]").text)
In [19]: all_urls = [dct["url"] for dct in js["itemListElement"]]
In [20]: print(all_urls)
['https://vine.co/v/iLK2rbzBU50', 'https://vine.co/v/iLK2iw305nH', 'https://vine.co/v/iLK2AadMMTO', 'https://vine.co/v/iLK2WY1EMWJ', 'https://vine.co/v/iLKQ6AdTtXE', 'https://vine.co/v/iLKQAPtKdwF', 'https://vine.co/v/iLKQAKpVJAM', 'https://vine.co/v/iLKxQqIH65I', 'https://vine.co/v/iLKxAuJwe2v', 'https://vine.co/v/iLKPQhZprq3', 'https://vine.co/v/iLKPIij7EzW', 'https://vine.co/v/iLKU697X3iQ', 'https://vine.co/v/iLKFZDTUHla', 'https://vine.co/v/iLKtPzahtel', 'https://vine.co/v/iLKTbpb1hgO', 'https://vine.co/v/iLKTaKYEx06', 'https://vine.co/v/iLKInbjuAnY', 'https://vine.co/v/iLKIBDbbDHY', 'https://vine.co/v/iLKjPxPz7bK', 'https://vine.co/v/iLKjFzKJwYF']
在 html 代码中,Vine <script type="application/ld+json">
包含指向页面上所有视频的链接,我如何才能访问此 JSON?
import requests
from bs4 import BeautifulSoup
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
您可以使用 css 选择器:
soup.select("script[type=application/ld+json]")
或find_all设置type="application/ld+json"
:
soup.find_all("script",type="application/ld+json")
两者都给你:
[<script type="application/ld+json">\n {\n "@context": "http://schema.org",\n "@type": "ItemList",\n "url": "https://vine.co/tags/funny",\n "itemListElement": [\n \n {\n "@type": "ListItem",\n "position": 1,\n "url": "https://vine.co/v/iLKgAXeqwqu"\n },\n \n {\n "@type": "ListItem",\n "position": 2,\n "url": "https://vine.co/v/iLK6p2UHDTl"\n },\n \n {\n "@type": "ListItem",\n "position": 3,\n "url": "https://vine.co/v/iLKrbIeXPTH"\n },\n \n {\n "@type": "ListItem",\n "position": 4,\n "url": "https://vine.co/v/iLKrbZ5zir0"\n },\n \n {\n "@type": "ListItem",\n "position": 5,\n "url": "https://vine.co/v/iLKvxUwLUxr"\n },\n \n {\n "@type": "ListItem",\n "position": 6,\n "url": "https://vine.co/v/iLKvnVOd7VA"\n },\n \n {\n "@type": "ListItem",\n "position": 7,\n "url": "https://vine.co/v/iLKv73UQmjB"\n },\n \n {\n "@type": "ListItem",\n "position": 8,\n "url": "https://vine.co/v/iLKvBeO9Fmt"\n },\n \n {\n "@type": "ListItem",\n "position": 9,\n "url": "https://vine.co/v/iLKnrqMDYeD"\n },\n \n {\n "@type": "ListItem",\n "position": 10,\n "url": "https://vine.co/v/iLKnWrjMqwE"\n },\n \n {\n "@type": "ListItem",\n "position": 11,\n "url": "https://vine.co/v/iLK17Bg1wt0"\n },\n \n {\n "@type": "ListItem",\n "position": 12,\n "url": "https://vine.co/v/iLK5ExAZ7WB"\n },\n \n {\n "@type": "ListItem",\n "position": 13,\n "url": "https://vine.co/v/iLK5Eg7vHM7"\n },\n \n {\n "@type": "ListItem",\n "position": 14,\n "url": "https://vine.co/v/iLKitbix3pb"\n },\n \n {\n "@type": "ListItem",\n "position": 15,\n "url": "https://vine.co/v/iLKOleYJhUp"\n },\n \n {\n "@type": "ListItem",\n "position": 16,\n "url": "https://vine.co/v/iLKOTFgXVFQ"\n },\n \n {\n "@type": "ListItem",\n "position": 17,\n "url": "https://vine.co/v/iLKMI6t91xe"\n },\n \n {\n "@type": "ListItem",\n "position": 18,\n "url": "https://vine.co/v/iLKMX6p0TD6"\n },\n \n {\n "@type": "ListItem",\n "position": 19,\n "url": "https://vine.co/v/iLKM6Hh1nzr"\n },\n \n {\n "@type": "ListItem",\n "position": 20,\n "url": "https://vine.co/v/iLKhQWVIAj3"\n }\n \n ]\n }\n </script>]
要把它变成json,你只需要json.loads文本,也因为只有一个,你可以使用 select_one 或 find:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
# js = json.loads(soup.find("script",type="application/ld+json").text)
js = json.loads(soup.select_one("script[type=application/ld+json]").text)
print(js)
这给你:
{u'url': u'https://vine.co/tags/funny', u'@context': u'http://schema.org', u'itemListElement': [{u'url': u'https://vine.co/v/iLKgAXeqwqu', u'position': 1, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK6p2UHDTl', u'position': 2, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbIeXPTH', u'position': 3, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbZ5zir0', u'position': 4, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvxUwLUxr', u'position': 5, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvnVOd7VA', u'position': 6, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKv73UQmjB', u'position': 7, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvBeO9Fmt', u'position': 8, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnrqMDYeD', u'position': 9, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnWrjMqwE', u'position': 10, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK17Bg1wt0', u'position': 11, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5ExAZ7WB', u'position': 12, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5Eg7vHM7', u'position': 13, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKitbix3pb', u'position': 14, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOleYJhUp', u'position': 15, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOTFgXVFQ', u'position': 16, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMI6t91xe', u'position': 17, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMX6p0TD6', u'position': 18, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKM6Hh1nzr', u'position': 19, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKhQWVIAj3', u'position': 20, u'@type': u'ListItem'}], u'@type': u'ItemList'}
最后一步只是解析 js 以获取 url,它们位于您可以使用 js["itemListElement"]
访问的字典列表中:
In [18]: js = json.loads(soup.select_one("script[type=application/ld+json]").text)
In [19]: all_urls = [dct["url"] for dct in js["itemListElement"]]
In [20]: print(all_urls)
['https://vine.co/v/iLK2rbzBU50', 'https://vine.co/v/iLK2iw305nH', 'https://vine.co/v/iLK2AadMMTO', 'https://vine.co/v/iLK2WY1EMWJ', 'https://vine.co/v/iLKQ6AdTtXE', 'https://vine.co/v/iLKQAPtKdwF', 'https://vine.co/v/iLKQAKpVJAM', 'https://vine.co/v/iLKxQqIH65I', 'https://vine.co/v/iLKxAuJwe2v', 'https://vine.co/v/iLKPQhZprq3', 'https://vine.co/v/iLKPIij7EzW', 'https://vine.co/v/iLKU697X3iQ', 'https://vine.co/v/iLKFZDTUHla', 'https://vine.co/v/iLKtPzahtel', 'https://vine.co/v/iLKTbpb1hgO', 'https://vine.co/v/iLKTaKYEx06', 'https://vine.co/v/iLKInbjuAnY', 'https://vine.co/v/iLKIBDbbDHY', 'https://vine.co/v/iLKjPxPz7bK', 'https://vine.co/v/iLKjFzKJwYF']