BeautifulSoup 不显示内容
BeautifulSoup doesn't display the content
我想从 MCX India 网站上抓取现货价格数据。
在检查元素时可见的 HTML 脚本如下:
<div class="contents spotmarketprice">
<div id="cont-1" style="display: block;">
<table class="mcx-table mrB20" width="100%" cellspacing="8" id="tblSMP">
<thead>
<tr>
<th class="symbol-head">
Commodity
</th>
<th>
Unit
</th>
<th class="left1">
Location
</th>
<th class="right1">
Spot Price (Rs.)
</th>
<th>
Up/Down
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="symbol" style="width:30%;">ALMOND</td>
<td style="width:17%;">1 KGS</td>
<td align="left" style="width:17%;">DELHI</td>
<td align="right" style="width:17%;">558.00</td>
<td align="right" class="padR20" style="width:19%;">=</td>
</tr>
我写的代码是:
#import the required libraries
from bs4 import BeautifulSoup
import requests
#Getting data from website
source= requests.get('http://www.mcxindia.com/market-data/spot-market-price').text
#Getting the html code of the website
soup = BeautifulSoup(source, 'lxml')
#Navigating to the blocks where required content is present
division_1= soup.find('div', class_="contents spotmarketprice").div.table
#Displaying the results
print(division_1.tbody)
输出:
<tbody>
</tbody>
在网站上,我想要获取的内容在...中可用,但是,这里没有显示任何内容。请提出解决方案。
似乎 table 中的数据正在通过 JavaScript 上传。
这就是为什么,如果您尝试使用 requests
库获取此信息,您不会在 return 上收到 table 的数据。 requests
根本不支持JS。因此,这里的问题不在BeautifulSoup
.
要抓取 JS 驱动的数据,请考虑使用 selenium
和 chromedriver。这种情况下的解决方案如下所示:
# import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# create a webdriver
chromedriver_path = 'C:\path\to\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
# go to the page and get its source
driver.get('http://www.mcxindia.com/market-data/spot-market-price')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# fetch mentioned data
table = soup.find('table', {'id': 'tblSMP'})
for tr in table.tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print(row)
# close the webdriver
driver.quit()
以上脚本的输出为:
['ALMOND', '1 KGS', 'DELHI', '558.00', '=']
['ALUMINIUM', '1 KGS', 'THANE', '137.60', '=']
['CARDAMOM', '1 KGS', 'VANDANMEDU', '2,525.00', '=']
['CASTORSEED', '100 KGS', 'DEESA', '3,626.00', '▼']
['CHANA', '100 KGS', 'DELHI', '4,163.00', '▲']
['COPPER', '1 KGS', 'THANE', '388.30', '=']
['COTTON', '1 BALES', 'RAJKOT', '15,790.00', '▲']
['CPO', '10 KGS', 'KANDLA', '630.10', '▼']
['CRUDEOIL', '1 BBL', 'MUMBAI', '2,418.00', '▲']
['GOLD', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDGUINEA', '8 GRMS', 'AHMEDABAD', '32,923.00', '=']
['GOLDM', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDPETAL', '1 GRMS', 'MUMBAI', '4,129.00', '=']
['GUARGUM', '100 KGS', 'JODHPUR', '5,880.00', '=']
['GUARSEED', '100 KGS', 'JODHPUR', '3,660.00', '=']
UPD:我必须指定上面的代码回答了看到这个特定 table 的问题。但是,有时网站将数据存储在 'application/json' 或可以使用 'requests' 库访问的类似标签中(因为它们不需要 JS)。
αԋɱҽԃ αмєяιcαη 发现,当前网站包含此类标签。请检查他的答案。这种情况下用requests
确实比selenium
好
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
allin.append([item[x] for x in goal])
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
输出:
EnSymbol Unit Location TodaysSpotPrice
0 ALMOND 1 KGS DELHI 558.00
1 ALUMINIUM 1 KGS THANE 137.60
2 CARDAMOM 1 KGS VANDANMEDU 2525.00
3 CASTORSEED 100 KGS DEESA 3626.00
4 CHANA 100 KGS DELHI 4163.00
5 COPPER 1 KGS THANE 388.30
6 COTTON 1 BALES RAJKOT 15880.00
7 CPO 10 KGS KANDLA 635.90
8 CRUDEOIL 1 BBL MUMBAI 2418.00
9 GOLD 10 GRMS AHMEDABAD 40989.00
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00
11 GOLDM 10 GRMS AHMEDABAD 40989.00
12 GOLDPETAL 1 GRMS MUMBAI 4129.00
13 GUARGUM 100 KGS JODHPUR 5880.00
14 GUARSEED 100 KGS JODHPUR 3660.00
15 KAPAS 20 KGS RAJKOT 927.50
16 LEAD 1 KGS CHENNAI 141.60
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10
18 NATURALGAS 1 mmBtu HAZIRA 138.50
19 NICKEL 1 KGS THANE 892.00
20 PEPPER 100 KGS KOCHI 32700.00
21 RAW JUTE 100 KGS KOLKATA 4999.00
22 RBD PALMOLEIN 10 KGS KANDLA 700.40
23 REFSOYOIL 10 KGS INDORE 845.25
24 SILVER 1 KGS AHMEDABAD 36871.00
25 SILVERM 1 KGS AHMEDABAD 36871.00
26 SILVERMIC 1 KGS AHMEDABAD 36871.00
27 SUGARMDEL 100 KGS DELHI 3380.00
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00
30 TIN 1 KGS MUMBAI 1160.50
31 WHEAT 100 KGS DELHI 1977.50
32 ZINC 1 KGS THANE 155.15
如果你想更改符号:
这是它的版本:
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice', 'Change']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
item = [item[x] for x in goal]
item[-1] = '▲' if item[-1] > 0 else '▼' if item[-1] < 0 else "="
allin.append(item)
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
输出:
EnSymbol Unit Location TodaysSpotPrice Change
0 ALMOND 1 KGS DELHI 558.00 =
1 ALUMINIUM 1 KGS THANE 137.60 =
2 CARDAMOM 1 KGS VANDANMEDU 2525.00 =
3 CASTORSEED 100 KGS DEESA 3626.00 =
4 CHANA 100 KGS DELHI 4163.00 =
5 COPPER 1 KGS THANE 388.30 =
6 COTTON 1 BALES RAJKOT 15880.00 ▲
7 CPO 10 KGS KANDLA 635.90 ▲
8 CRUDEOIL 1 BBL MUMBAI 2418.00 ▲
9 GOLD 10 GRMS AHMEDABAD 40989.00 =
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00 =
11 GOLDM 10 GRMS AHMEDABAD 40989.00 =
12 GOLDPETAL 1 GRMS MUMBAI 4129.00 =
13 GUARGUM 100 KGS JODHPUR 5880.00 =
14 GUARSEED 100 KGS JODHPUR 3660.00 =
15 KAPAS 20 KGS RAJKOT 927.50 ▲
16 LEAD 1 KGS CHENNAI 141.60 =
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10 =
18 NATURALGAS 1 mmBtu HAZIRA 138.50 ▲
19 NICKEL 1 KGS THANE 892.00 =
20 PEPPER 100 KGS KOCHI 32600.00 ▼
21 RAW JUTE 100 KGS KOLKATA 4999.00 =
22 RBD PALMOLEIN 10 KGS KANDLA 700.40 ▼
23 REFSOYOIL 10 KGS INDORE 845.25 =
24 SILVER 1 KGS AHMEDABAD 36871.00 =
25 SILVERM 1 KGS AHMEDABAD 36871.00 =
26 SILVERMIC 1 KGS AHMEDABAD 36871.00 =
27 SUGARMDEL 100 KGS DELHI 3380.00 ▼
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00 ▲
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00 ▼
30 TIN 1 KGS MUMBAI 1160.50 ▼
31 WHEAT 100 KGS DELHI 1977.50 ▲
32 ZINC 1 KGS THANE 155.15 =
我想从 MCX India 网站上抓取现货价格数据。 在检查元素时可见的 HTML 脚本如下:
<div class="contents spotmarketprice">
<div id="cont-1" style="display: block;">
<table class="mcx-table mrB20" width="100%" cellspacing="8" id="tblSMP">
<thead>
<tr>
<th class="symbol-head">
Commodity
</th>
<th>
Unit
</th>
<th class="left1">
Location
</th>
<th class="right1">
Spot Price (Rs.)
</th>
<th>
Up/Down
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="symbol" style="width:30%;">ALMOND</td>
<td style="width:17%;">1 KGS</td>
<td align="left" style="width:17%;">DELHI</td>
<td align="right" style="width:17%;">558.00</td>
<td align="right" class="padR20" style="width:19%;">=</td>
</tr>
我写的代码是:
#import the required libraries
from bs4 import BeautifulSoup
import requests
#Getting data from website
source= requests.get('http://www.mcxindia.com/market-data/spot-market-price').text
#Getting the html code of the website
soup = BeautifulSoup(source, 'lxml')
#Navigating to the blocks where required content is present
division_1= soup.find('div', class_="contents spotmarketprice").div.table
#Displaying the results
print(division_1.tbody)
输出:
<tbody>
</tbody>
在网站上,我想要获取的内容在...中可用,但是,这里没有显示任何内容。请提出解决方案。
似乎 table 中的数据正在通过 JavaScript 上传。
这就是为什么,如果您尝试使用 requests
库获取此信息,您不会在 return 上收到 table 的数据。 requests
根本不支持JS。因此,这里的问题不在BeautifulSoup
.
要抓取 JS 驱动的数据,请考虑使用 selenium
和 chromedriver。这种情况下的解决方案如下所示:
# import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# create a webdriver
chromedriver_path = 'C:\path\to\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
# go to the page and get its source
driver.get('http://www.mcxindia.com/market-data/spot-market-price')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# fetch mentioned data
table = soup.find('table', {'id': 'tblSMP'})
for tr in table.tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print(row)
# close the webdriver
driver.quit()
以上脚本的输出为:
['ALMOND', '1 KGS', 'DELHI', '558.00', '=']
['ALUMINIUM', '1 KGS', 'THANE', '137.60', '=']
['CARDAMOM', '1 KGS', 'VANDANMEDU', '2,525.00', '=']
['CASTORSEED', '100 KGS', 'DEESA', '3,626.00', '▼']
['CHANA', '100 KGS', 'DELHI', '4,163.00', '▲']
['COPPER', '1 KGS', 'THANE', '388.30', '=']
['COTTON', '1 BALES', 'RAJKOT', '15,790.00', '▲']
['CPO', '10 KGS', 'KANDLA', '630.10', '▼']
['CRUDEOIL', '1 BBL', 'MUMBAI', '2,418.00', '▲']
['GOLD', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDGUINEA', '8 GRMS', 'AHMEDABAD', '32,923.00', '=']
['GOLDM', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDPETAL', '1 GRMS', 'MUMBAI', '4,129.00', '=']
['GUARGUM', '100 KGS', 'JODHPUR', '5,880.00', '=']
['GUARSEED', '100 KGS', 'JODHPUR', '3,660.00', '=']
UPD:我必须指定上面的代码回答了看到这个特定 table 的问题。但是,有时网站将数据存储在 'application/json' 或可以使用 'requests' 库访问的类似标签中(因为它们不需要 JS)。
αԋɱҽԃ αмєяιcαη 发现,当前网站包含此类标签。请检查他的答案。这种情况下用requests
确实比selenium
好
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
allin.append([item[x] for x in goal])
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
输出:
EnSymbol Unit Location TodaysSpotPrice
0 ALMOND 1 KGS DELHI 558.00
1 ALUMINIUM 1 KGS THANE 137.60
2 CARDAMOM 1 KGS VANDANMEDU 2525.00
3 CASTORSEED 100 KGS DEESA 3626.00
4 CHANA 100 KGS DELHI 4163.00
5 COPPER 1 KGS THANE 388.30
6 COTTON 1 BALES RAJKOT 15880.00
7 CPO 10 KGS KANDLA 635.90
8 CRUDEOIL 1 BBL MUMBAI 2418.00
9 GOLD 10 GRMS AHMEDABAD 40989.00
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00
11 GOLDM 10 GRMS AHMEDABAD 40989.00
12 GOLDPETAL 1 GRMS MUMBAI 4129.00
13 GUARGUM 100 KGS JODHPUR 5880.00
14 GUARSEED 100 KGS JODHPUR 3660.00
15 KAPAS 20 KGS RAJKOT 927.50
16 LEAD 1 KGS CHENNAI 141.60
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10
18 NATURALGAS 1 mmBtu HAZIRA 138.50
19 NICKEL 1 KGS THANE 892.00
20 PEPPER 100 KGS KOCHI 32700.00
21 RAW JUTE 100 KGS KOLKATA 4999.00
22 RBD PALMOLEIN 10 KGS KANDLA 700.40
23 REFSOYOIL 10 KGS INDORE 845.25
24 SILVER 1 KGS AHMEDABAD 36871.00
25 SILVERM 1 KGS AHMEDABAD 36871.00
26 SILVERMIC 1 KGS AHMEDABAD 36871.00
27 SUGARMDEL 100 KGS DELHI 3380.00
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00
30 TIN 1 KGS MUMBAI 1160.50
31 WHEAT 100 KGS DELHI 1977.50
32 ZINC 1 KGS THANE 155.15
如果你想更改符号:
这是它的版本:
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice', 'Change']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
item = [item[x] for x in goal]
item[-1] = '▲' if item[-1] > 0 else '▼' if item[-1] < 0 else "="
allin.append(item)
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
输出:
EnSymbol Unit Location TodaysSpotPrice Change
0 ALMOND 1 KGS DELHI 558.00 =
1 ALUMINIUM 1 KGS THANE 137.60 =
2 CARDAMOM 1 KGS VANDANMEDU 2525.00 =
3 CASTORSEED 100 KGS DEESA 3626.00 =
4 CHANA 100 KGS DELHI 4163.00 =
5 COPPER 1 KGS THANE 388.30 =
6 COTTON 1 BALES RAJKOT 15880.00 ▲
7 CPO 10 KGS KANDLA 635.90 ▲
8 CRUDEOIL 1 BBL MUMBAI 2418.00 ▲
9 GOLD 10 GRMS AHMEDABAD 40989.00 =
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00 =
11 GOLDM 10 GRMS AHMEDABAD 40989.00 =
12 GOLDPETAL 1 GRMS MUMBAI 4129.00 =
13 GUARGUM 100 KGS JODHPUR 5880.00 =
14 GUARSEED 100 KGS JODHPUR 3660.00 =
15 KAPAS 20 KGS RAJKOT 927.50 ▲
16 LEAD 1 KGS CHENNAI 141.60 =
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10 =
18 NATURALGAS 1 mmBtu HAZIRA 138.50 ▲
19 NICKEL 1 KGS THANE 892.00 =
20 PEPPER 100 KGS KOCHI 32600.00 ▼
21 RAW JUTE 100 KGS KOLKATA 4999.00 =
22 RBD PALMOLEIN 10 KGS KANDLA 700.40 ▼
23 REFSOYOIL 10 KGS INDORE 845.25 =
24 SILVER 1 KGS AHMEDABAD 36871.00 =
25 SILVERM 1 KGS AHMEDABAD 36871.00 =
26 SILVERMIC 1 KGS AHMEDABAD 36871.00 =
27 SUGARMDEL 100 KGS DELHI 3380.00 ▼
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00 ▲
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00 ▼
30 TIN 1 KGS MUMBAI 1160.50 ▼
31 WHEAT 100 KGS DELHI 1977.50 ▲
32 ZINC 1 KGS THANE 155.15 =