从 url 文本自动生成文件名

Automating filename generation from url text

我正在从网络上解析一些内容,然后将其保存到一个文件中。到目前为止,我手动创建文件名。

这是我的代码:

import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)

如何自动创建和保存来自 url 的以下 html 文件名的过程:

The-Google-Way-Revolutionizing-Management(而不是 html_output_test?

此名称来自我发布的原始书店 url,可能已修改以避免产品广告。

谢谢!

您可以使用 BeautifulSoup 从页面获取标题文本,我会让请求使用 .content:

处理编码
url = "http://rads.whosebug.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup

print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
    file.write(html)

The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books

对于那个特定页面,如果您只想要 Google 方式:一家公司如何革新我们所知道的管理方式 产品标题在 class a-size-large:

text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
    file.write(html)

带有The-Google-Way-Revolutionizing-Management的link在link标签中:

link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])

http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840

因此,要获得该部分,您需要对其进行解析:

print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management

link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as   file:
    file.write(html)

您可以使用 beautiful soup 解析网页,获取页面的名称,然后将其 slugify 并用作文件名,或者生成一个随机文件名,例如 os.tmpfile.