为什么这个网站不能用 bs4 抓取?
Why is this website not scrape-able with bs4?
我是学习网络抓取的初学者,由于某些原因我无法抓取 this 网站。当我在 Chrome 中检查它时,代码看起来不错,但是当我用 BeautifulSoup 阅读它时,它不再是可抓取的。汤提到'Google Analytics',我真的不知道那是什么
本站内容通过JavaScript加载,但您可以使用requests
模块获取个别章节。 URL 章节的格式为 https://detroitbecometext.github.io/assets/html/chapterXY.html
(example).
例如这张纸条:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://detroitbecometext.github.io/chapters'
asset_url = 'https://detroitbecometext.github.io/assets/html/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
main_js = requests.get('https://detroitbecometext.github.io/' + soup.select_one('script[src^="main."]')['src']).text
for ch in re.findall(r'(chapter[\d.]+\.html?)', main_js):
soup = BeautifulSoup(requests.get(asset_url + ch).content, 'html.parser')
print(soup.get_text())
print('-' * 80)
打印所有章节的所有文本:
...
Out of the elevator
SWAT: Negotiator on site. Repeat, negotiator on site.
Caroline Phillips: No, stop... I... I... I can't leave her. Oh, oh please, please, you gotta save my little girl... Wait... you're
sending an android?
SWAT: Alright, ma'am. We need to go.
Caroline Phillips: You can't...you can't do that! You W- Why aren't you sending a real
person? Don't let that thing near her! Keep that thing away from my daughter! KEEP IT AWAY!
...
我是学习网络抓取的初学者,由于某些原因我无法抓取 this 网站。当我在 Chrome 中检查它时,代码看起来不错,但是当我用 BeautifulSoup 阅读它时,它不再是可抓取的。汤提到'Google Analytics',我真的不知道那是什么
本站内容通过JavaScript加载,但您可以使用requests
模块获取个别章节。 URL 章节的格式为 https://detroitbecometext.github.io/assets/html/chapterXY.html
(example).
例如这张纸条:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://detroitbecometext.github.io/chapters'
asset_url = 'https://detroitbecometext.github.io/assets/html/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
main_js = requests.get('https://detroitbecometext.github.io/' + soup.select_one('script[src^="main."]')['src']).text
for ch in re.findall(r'(chapter[\d.]+\.html?)', main_js):
soup = BeautifulSoup(requests.get(asset_url + ch).content, 'html.parser')
print(soup.get_text())
print('-' * 80)
打印所有章节的所有文本:
...
Out of the elevator
SWAT: Negotiator on site. Repeat, negotiator on site.
Caroline Phillips: No, stop... I... I... I can't leave her. Oh, oh please, please, you gotta save my little girl... Wait... you're
sending an android?
SWAT: Alright, ma'am. We need to go.
Caroline Phillips: You can't...you can't do that! You W- Why aren't you sending a real
person? Don't let that thing near her! Keep that thing away from my daughter! KEEP IT AWAY!
...