正在从文件中检索 URL 以进行抓取

Question

我做了一个刮板，我想做函数 "page_link = """ 扫描保存在 JSON、XML 或 SQL 文件中的每个 URL。

有人可以给我指明方向，以便我可以学习如何使它动态而不是静态吗？

您不必给我答案，只需指出我可以在哪里了解更多关于我应该做什么的信息。我还在学习。

    from bs4 import BeautifulSoup
import requests
print('step 1')
#get url
page_link = "<random website with info>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
naam = page_content.find_all(class_='<random class>')[0].decode_contents()
print('step 5')
#printen
print(naam)

Answer 1

JSON 似乎是完成这项工作的正确工具。 XML 和 SQL 对于您需要的简单功能来说有点笨拙。此外，Python 具有内置的 json reading/writing 功能（json 在很多方面与 Python dict 非常相似）。

只需在与此类似的 json 文件中维护您要访问的站点列表（将其放在名为 test.json 的文件中）：

{
    "sites": ["www.google.com",
              "www.facebook.com",
              "www.example.com"]
}

然后对以下每个网站进行抓取：

import json
with open('test.json') as my_json:
    json_dict = json.load(my_json)
for website in json_dict["sites"]:
    print("About to scrape: ", website)

    # do scraping
    page_link = website
    ...

此输出（如果删除 ...）：

About to scrape:  www.google.com
About to scrape:  www.facebook.com
About to scrape:  www.example.com

只需将您要用于抓取的其余逻辑（就像您在问题中的上述内容一样）放在 # do scraping 评论下。

正在从文件中检索 URL 以进行抓取

Retrieving URL out of a file for scraping

python

scraper

python-3.x