有什么方法可以从 python 中的 url 下载带有自定义查询的数据？

Question

我想使用自定义查询从 USDA 站点下载数据。因此，我没有在网站上手动 selecting 查询，而是在考虑如何在 python 中更方便地执行此操作。为此，我使用 request、http 访问 url 并阅读内容，这对我来说并不直观，我应该如何传递查询然后进行 select离子并将数据下载为 csv。有谁知道在 python 中可以轻松做到这一点吗？我们可以通过特定查询从 url 下载数据吗？有什么想法吗？

这是我目前的尝试

这里是 url，我将使用自定义查询 select 数据。

import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))

所以在阅读 pandas 中请求的 json 之前，我需要通过以下查询以获取正确的数据 selection:

Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"

我不直观，我应该如何通过请求 json 传递查询，然后将过滤后的数据下载为 csv。在 python 中有什么有效的方法可以做到这一点吗？有什么想法吗？谢谢

Answer 1

一些细节

最简单的格式是文本而不是 HTML。从 HTML 页面获得 URL 用于文本下载
requests(params=) 是一个 dict。搭建成功，无需处理 building complete URL string
清楚的文本是 space 分隔的，找到双 space

import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")

	Date	Region	Feature Rate	Outlets	Special Rate	Activity Index
0	02/05/2021	NATIONAL	69.40%	29,200	20.10%	81,650
1	02/05/2021	NORTHEAST	75.00%	5,500	3.80%	17,520
2	02/05/2021	SOUTHEAST	70.10%	7,400	28.00%	23,980
3	02/05/2021	MIDWEST	75.10%	6,100	19.90%	17,430
4	02/05/2021	SOUTH CENTRAL	57.90%	4,900	26.40%	9,720
5	02/05/2021	NORTHWEST	77.50%	1,300	2.50%	3,150
6	02/05/2021	SOUTHWEST	63.20%	3,800	27.50%	9,360
7	02/05/2021	ALASKA	87.00%	200	.00%	290
8	02/05/2021	HAWAII	46.70%	100	.00%	230

Answer 2

只需在 url 中格式化查询数据 - 它实际上是一个 REST API:

要添加更多查询数据，如@mullinscr 所说，您可以更改左侧的值并按提交，然后在 URL 中看到查询名称（例如，开始日期称为 repDate).

如果您将鼠标悬停在下载为 XML link 上，您还会发现可以使用 format=<format_name> 指定下载格式。使用 pandas 解析 XML 中的表格数据可能更容易，因此我也会在末尾附加 format=xml。

category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"

# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")

url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"

# parse with pandas, etc...

有什么方法可以从 python 中的 url 下载带有自定义查询的数据？

any way to download the data with custom queries from url in python?

python

json

request

pandas