无法从 Python 的网站抓取数据
Unable to scrape data from Website with Python
我想从“交易所交易的债券”和“场外交易”中提取表格并将其保存到 excel sheet。
我正在尝试使用 python(BS 和请求)来抓取数据,但我无法抓取数据(我不想使用 selenium)。任何人都可以指导我吗?
我没有收到任何错误,它没有在 python 终端中处理
我认为终端被绞死了,因为我什至没有收到任何错误消息。
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup as bs
url = "https://www1.nseindia.com/products/content/debt/corp_bonds/cbm_reporting_homepage.htm"
#condition True
#while condition:
html = requests.get(url).content
page= requests.get(url)
soup= bs(page.text, 'lxml')
df_list = pd.read_html(html)
df = df_list[0] # can change 0 to other number
print(df)
如果您查看“网络”选项卡,您会看到 cbm_reporting_cbricsL.htm
which is what you need to scrape. By the way, you should also add headers for requests to work properly. See detailed explanation in this thread:
import requests
import pandas as pd
from bs4 import BeautifulSoup
res = requests.get(
'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
)
soup = BeautifulSoup(res.text, 'lxml')
raw_columns = [row.find_all('td') for row in soup.find_all('tr')]
# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])
结果如下:
0 [INE001A07TA7] [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S... [ 100.0030] [ 4.7082] [ 16] [[ 168000.00]] [ 100.0000] [ 4.7091]
1 [INE134E07AP6] [POWER FINANCE CORPORATION LTD. TRI SRV CATIII... [ 100.8500] [ 6.6934] [ 1] [ 1000.00 ] [ 100.8500] [ 6.6934]
2 [INE020B08963] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 107.6835] [ 5.9200] [ 1] [ 1500.00 ] [ 107.6835] [ 5.9200]
3 [INE163N08131] [-] [ 104.2195] [ 6.6200] [ 1] [ 780.00 ] [ 104.2195] [ 6.6200]
4 [INE540P07343] [-] [ 104.3408] [ 9.3603] [ 6] [[ 1110.00]] [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93 [INE377Y07250] [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ... [ 100.0300] [ 5.6845] [ 1] [ 9000.00 ] [ 100.0300] [ 5.6845]
94 [INE115A07ML7] [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC... [ 105.0991] [ 5.5000] [ 1] [ 1000.00 ] [ 105.0991] [ 5.5000]
95 [INE020B07HN3] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 123.6000] [ 4.4400] [ 1] [ 10.00 ] [ 123.6000] [ 4.4400]
96 [INE101A08070] [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63... [ 125.5000] [ 7.5218] [ 1] [ 820.00 ] [ 125.5000] [ 7.5218]
97 [INE062A08215] [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA... [ 104.5304] [ 7.0000] [ 1] [ 10.00 ] [ 104.5304] [ 7.0000]
这是我最后得到的结果。
import requests
import pandas as pd
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
html = requests.get(
'abcd',
headers=headers).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)
我想从“交易所交易的债券”和“场外交易”中提取表格并将其保存到 excel sheet。 我正在尝试使用 python(BS 和请求)来抓取数据,但我无法抓取数据(我不想使用 selenium)。任何人都可以指导我吗? 我没有收到任何错误,它没有在 python 终端中处理 我认为终端被绞死了,因为我什至没有收到任何错误消息。
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup as bs
url = "https://www1.nseindia.com/products/content/debt/corp_bonds/cbm_reporting_homepage.htm"
#condition True
#while condition:
html = requests.get(url).content
page= requests.get(url)
soup= bs(page.text, 'lxml')
df_list = pd.read_html(html)
df = df_list[0] # can change 0 to other number
print(df)
如果您查看“网络”选项卡,您会看到 cbm_reporting_cbricsL.htm
which is what you need to scrape. By the way, you should also add headers for requests to work properly. See detailed explanation in this thread:
import requests
import pandas as pd
from bs4 import BeautifulSoup
res = requests.get(
'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
)
soup = BeautifulSoup(res.text, 'lxml')
raw_columns = [row.find_all('td') for row in soup.find_all('tr')]
# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])
结果如下:
0 [INE001A07TA7] [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S... [ 100.0030] [ 4.7082] [ 16] [[ 168000.00]] [ 100.0000] [ 4.7091]
1 [INE134E07AP6] [POWER FINANCE CORPORATION LTD. TRI SRV CATIII... [ 100.8500] [ 6.6934] [ 1] [ 1000.00 ] [ 100.8500] [ 6.6934]
2 [INE020B08963] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 107.6835] [ 5.9200] [ 1] [ 1500.00 ] [ 107.6835] [ 5.9200]
3 [INE163N08131] [-] [ 104.2195] [ 6.6200] [ 1] [ 780.00 ] [ 104.2195] [ 6.6200]
4 [INE540P07343] [-] [ 104.3408] [ 9.3603] [ 6] [[ 1110.00]] [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93 [INE377Y07250] [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ... [ 100.0300] [ 5.6845] [ 1] [ 9000.00 ] [ 100.0300] [ 5.6845]
94 [INE115A07ML7] [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC... [ 105.0991] [ 5.5000] [ 1] [ 1000.00 ] [ 105.0991] [ 5.5000]
95 [INE020B07HN3] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 123.6000] [ 4.4400] [ 1] [ 10.00 ] [ 123.6000] [ 4.4400]
96 [INE101A08070] [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63... [ 125.5000] [ 7.5218] [ 1] [ 820.00 ] [ 125.5000] [ 7.5218]
97 [INE062A08215] [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA... [ 104.5304] [ 7.0000] [ 1] [ 10.00 ] [ 104.5304] [ 7.0000]
这是我最后得到的结果。
import requests
import pandas as pd
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
html = requests.get(
'abcd',
headers=headers).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)