为什么这个网站不能用 bs4 抓取？

Question

我是学习网络抓取的初学者，由于某些原因我无法抓取 this 网站。当我在 Chrome 中检查它时，代码看起来不错，但是当我用 BeautifulSoup 阅读它时，它不再是可抓取的。汤提到'Google Analytics'，我真的不知道那是什么

Answer 1

本站内容通过JavaScript加载，但您可以使用requests模块获取个别章节。 URL 章节的格式为 https://detroitbecometext.github.io/assets/html/chapterXY.html (example).

例如这张纸条：

import re
import requests
from bs4 import BeautifulSoup


url = 'https://detroitbecometext.github.io/chapters'
asset_url = 'https://detroitbecometext.github.io/assets/html/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
main_js = requests.get('https://detroitbecometext.github.io/' + soup.select_one('script[src^="main."]')['src']).text

for ch in re.findall(r'(chapter[\d.]+\.html?)', main_js):
    soup = BeautifulSoup(requests.get(asset_url + ch).content, 'html.parser')
    print(soup.get_text())
    print('-' * 80)

打印所有章节的所有文本：

...


Out of the elevator

SWAT: Negotiator on site. Repeat, negotiator on site.
Caroline Phillips: No, stop... I... I... I can't leave her. Oh, oh please, please, you gotta save my little girl... Wait... you're
        sending an android?
SWAT: Alright, ma'am. We need to go.
Caroline Phillips: You can't...you can't do that! You W- Why aren't you sending a real
        person? Don't let that thing near her! Keep that thing away from my daughter! KEEP IT AWAY!
    

...

为什么这个网站不能用 bs4 抓取？

Why is this website not scrape-able with bs4?

html

python

google-analytics

beautifulsoup

web-scraping