只有在浏览器中打开主页 (url) 的另一个 URL 时,才能从浏览器下载来自 URL (url 1) 的 csv 文件。如何在 python 中实施
Fom browser a csv file from URL (url 1) can be downloaded only if another URL of main page (url) is open in browser. How to implement in python
如果 url https://www.nseindia.com/companies-listing/corporate-filings-announcements is open in a tab of browser, I can download the CSV file using another url https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true 来自同一浏览器中的另一个选项卡。 Else Not 并显示“找不到资源”。如何使用 pandas.
在 python 中实现它
此页面使用 Cookie 来检查文件是否从第一页打开。
您将必须使用 requests
和 Session
来获取第一页和 cookie,接下来使用 requests
和 Session
(使用先前请求的 cookie)来获取文件 csv
,最后你必须使用 io
将数据发送到 pandas
,它在内存中模拟文件。
顺便说一句:它似乎用 BOM (Byte Order Mark
) 发送文件所以我从 r.content
读取字节数据而不是从 r.text
和 pandas
将跳过 BOM
import requests
import pandas as pd
import io
# --- create Session with User-Agent from real browser ---
headers = {
'User-Agent': 'Mozilla/5.0'
}
s = requests.Session()
s.headers.update(headers)
# --- get first page to get cookies ---
url = 'https://www.nseindia.com/companies-listing/corporate-filings-announcements'
r = s.get(url)
# --- get file ---
url = 'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true'
r = s.get(url)
print(r.text[:100]) # code `` at the beginning means BOM
# so I will use `r.content` instead of `r.text`
# --- read file from memory ---
#df = pd.read_csv(io.StringIO(r.text)) # it doesn't remove BOM
df = pd.read_csv(io.BytesIO(r.content)) # it removes BOM
# --- show it ---
print(df.head())
结果:
"SYMBOL","COMPANY NAME","SUBJECT","DETAILS","BROADCAST DATE/TIME","RECEIPT","DISSEMINATION","DIFF
SYMBOL ... DIFFERENCE
0 TATAELXSI ... 00:00:08
1 RIIL ... 00:00:10
2 ERIS ... 00:00:06
3 RIIL ... 00:00:09
4 INGERRAND ... 00:00:09
[5 rows x 8 columns]
如果 url https://www.nseindia.com/companies-listing/corporate-filings-announcements is open in a tab of browser, I can download the CSV file using another url https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true 来自同一浏览器中的另一个选项卡。 Else Not 并显示“找不到资源”。如何使用 pandas.
在 python 中实现它此页面使用 Cookie 来检查文件是否从第一页打开。
您将必须使用 requests
和 Session
来获取第一页和 cookie,接下来使用 requests
和 Session
(使用先前请求的 cookie)来获取文件 csv
,最后你必须使用 io
将数据发送到 pandas
,它在内存中模拟文件。
顺便说一句:它似乎用 BOM (Byte Order Mark
) 发送文件所以我从 r.content
读取字节数据而不是从 r.text
和 pandas
将跳过 BOM
import requests
import pandas as pd
import io
# --- create Session with User-Agent from real browser ---
headers = {
'User-Agent': 'Mozilla/5.0'
}
s = requests.Session()
s.headers.update(headers)
# --- get first page to get cookies ---
url = 'https://www.nseindia.com/companies-listing/corporate-filings-announcements'
r = s.get(url)
# --- get file ---
url = 'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true'
r = s.get(url)
print(r.text[:100]) # code `` at the beginning means BOM
# so I will use `r.content` instead of `r.text`
# --- read file from memory ---
#df = pd.read_csv(io.StringIO(r.text)) # it doesn't remove BOM
df = pd.read_csv(io.BytesIO(r.content)) # it removes BOM
# --- show it ---
print(df.head())
结果:
"SYMBOL","COMPANY NAME","SUBJECT","DETAILS","BROADCAST DATE/TIME","RECEIPT","DISSEMINATION","DIFF
SYMBOL ... DIFFERENCE
0 TATAELXSI ... 00:00:08
1 RIIL ... 00:00:10
2 ERIS ... 00:00:06
3 RIIL ... 00:00:09
4 INGERRAND ... 00:00:09
[5 rows x 8 columns]