从 url 文本自动生成文件名
Automating filename generation from url text
我正在从网络上解析一些内容,然后将其保存到一个文件中。到目前为止,我手动创建文件名。
这是我的代码:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
如何自动创建和保存来自 url 的以下 html 文件名的过程:
The-Google-Way-Revolutionizing-Management(而不是 html_output_test?
此名称来自我发布的原始书店 url,可能已修改以避免产品广告。
谢谢!
您可以使用 BeautifulSoup 从页面获取标题文本,我会让请求使用 .content:
处理编码
url = "http://rads.whosebug.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
对于那个特定页面,如果您只想要 Google 方式:一家公司如何革新我们所知道的管理方式 产品标题在 class a-size-large
:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
带有The-Google-Way-Revolutionizing-Management的link在link标签中:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
因此,要获得该部分,您需要对其进行解析:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)
您可以使用 beautiful soup 解析网页,获取页面的名称,然后将其 slugify 并用作文件名,或者生成一个随机文件名,例如 os.tmpfile.
我正在从网络上解析一些内容,然后将其保存到一个文件中。到目前为止,我手动创建文件名。
这是我的代码:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
如何自动创建和保存来自 url 的以下 html 文件名的过程:
The-Google-Way-Revolutionizing-Management(而不是 html_output_test?
此名称来自我发布的原始书店 url,可能已修改以避免产品广告。
谢谢!
您可以使用 BeautifulSoup 从页面获取标题文本,我会让请求使用 .content:
处理编码url = "http://rads.whosebug.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
对于那个特定页面,如果您只想要 Google 方式:一家公司如何革新我们所知道的管理方式 产品标题在 class a-size-large
:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
带有The-Google-Way-Revolutionizing-Management的link在link标签中:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
因此,要获得该部分,您需要对其进行解析:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)
您可以使用 beautiful soup 解析网页,获取页面的名称,然后将其 slugify 并用作文件名,或者生成一个随机文件名,例如 os.tmpfile.