获取文件的修改日期 - 在 python 中使用 beautifulsoup 进行网络抓取
Getting date modified of the files - webscraping with beautifulsoup in python
我正在尝试从以下网站下载所有 csv 文件:https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices。我已设法使用以下代码做到这一点:
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
contents = []
for i in csv_links:
req = requests.get(i)
csv_contents = req.content
s=str(csv_contents,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
contents.append(df)
final_price = pd.concat(contents)
如果可行的话,我想简化这个过程。网站上的文件每天都在修改,不想每天运行脚本把所有的文件都提取出来;相反,我只是想从 Yesterday 中提取文件并将现有文件附加到我的文件夹中。为了实现这一点,我需要将修改日期列与文件 URL 一起抓取。如果有人能告诉我如何获取文件更新的日期,我将不胜感激。
您可以使用 nth-child 范围过滤 table 的第 1 列和第 2 列,以及 table 最初与 class 匹配的适当行偏移量。
然后在拆分初始返回列表的列表推导中提取 url 或日期(作为文本)(交替第 1 列第 2 列第 1 列等)。在各自的列表理解中完成 url 或转换为实际日期(文本),压缩结果列表并转换为 DataFrame
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices')
soup = bs(r.content, 'lxml')
selected_columns = soup.select('.table tr:nth-child(n+3) td:nth-child(-n+2)')
df = pd.DataFrame(zip(['https://emi.ea.govt.nz' + i.a['href'] for i in selected_columns[0::1]],
[datetime.strptime(i.text, '%d %b %Y').date() for i in selected_columns[1::2]]), columns=['name', 'date_modified'])
print(df)
你可以应用列表理解技术
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
print(r)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
date=[]
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td[class="expand-column csv"] a')]
modified_date=[ date.text for date in soup.select('td[class="two"] a')[1:]]
links.extend(csv_links)
date.extend(modified_date)
df = pd.DataFrame(data=list(zip(links,date)),columns=['csv_links','modified_date'])
print(df)
输出:
csv_links modified_date
0 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
1 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
2 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
3 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
4 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
.. ... ...
107 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
108 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
109 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
110 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
111 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
[112 rows x 2 columns]
我正在尝试从以下网站下载所有 csv 文件:https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices。我已设法使用以下代码做到这一点:
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
contents = []
for i in csv_links:
req = requests.get(i)
csv_contents = req.content
s=str(csv_contents,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
contents.append(df)
final_price = pd.concat(contents)
如果可行的话,我想简化这个过程。网站上的文件每天都在修改,不想每天运行脚本把所有的文件都提取出来;相反,我只是想从 Yesterday 中提取文件并将现有文件附加到我的文件夹中。为了实现这一点,我需要将修改日期列与文件 URL 一起抓取。如果有人能告诉我如何获取文件更新的日期,我将不胜感激。
您可以使用 nth-child 范围过滤 table 的第 1 列和第 2 列,以及 table 最初与 class 匹配的适当行偏移量。
然后在拆分初始返回列表的列表推导中提取 url 或日期(作为文本)(交替第 1 列第 2 列第 1 列等)。在各自的列表理解中完成 url 或转换为实际日期(文本),压缩结果列表并转换为 DataFrame
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices')
soup = bs(r.content, 'lxml')
selected_columns = soup.select('.table tr:nth-child(n+3) td:nth-child(-n+2)')
df = pd.DataFrame(zip(['https://emi.ea.govt.nz' + i.a['href'] for i in selected_columns[0::1]],
[datetime.strptime(i.text, '%d %b %Y').date() for i in selected_columns[1::2]]), columns=['name', 'date_modified'])
print(df)
你可以应用列表理解技术
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
print(r)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
date=[]
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td[class="expand-column csv"] a')]
modified_date=[ date.text for date in soup.select('td[class="two"] a')[1:]]
links.extend(csv_links)
date.extend(modified_date)
df = pd.DataFrame(data=list(zip(links,date)),columns=['csv_links','modified_date'])
print(df)
输出:
csv_links modified_date
0 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
1 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
2 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
3 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
4 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
.. ... ...
107 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
108 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
109 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
110 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
111 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
[112 rows x 2 columns]