我试图点击展开按钮,然后抓取 table
I am trying to click on expand button and then scrape the table
我正在抓取网站 table 表单 https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 但我没有得到我正在寻找的确切结果。我想要这个 link、“公司名称”、“Class”、“州”、“公司类型”、“RoC”、“子类别”、“列表状态”中的 11 列。这是 7 列,之后您可以看到一个展开按钮“2017-18 财年的 CSR 详细信息”,当您单击该按钮时,您将获得另外 4 列“平均净利润”、“CSR 规定支出”、“CSR 支出” ", "当地花费"。我想要 csv 文件中的所有这些列。我写了一段代码,但它不能正常工作。我附上结果图片以供参考。这是我的代码。请帮助获取这些数据。
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv
driver = webdriver.Chrome()
url_file = "csrdata.txt"
with open(url_file, "r") as url:
url_pages = url.read()
# we need to split each urls into lists to make it iterable
pages = url_pages.split("\n") # Split by lines using \n
data = []
# now we run a for loop to visit the urls one by one
for single_page in pages:
driver.get(single_page)
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
driver.find_element_by_link_text("CSR Details of FY 2017-18").click()
table = driver.find_elements_by_xpath("//*[contains(@id,'colfy4')]")
about = table.__getitem__(0).text
x = about.split('\n')
print(x)
data.append(x)
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('csr.csv')
您不需要使用 selenium,因为所有信息都在 html 代码中。您也可以使用 pandas 内置函数 pd_read_html()
直接将 html-table 转换为数据帧。
data = []
for single_page in pages:
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('table') #finds all tables
table_top = pd.read_html(str(table))[0] #the top table
try: #try to get the other table if exists
table_extra = pd.read_html(str(table))[7]
except:
table_extra = pd.DataFrame()
result = pd.concat([table_top, table_extra])
data.append(result)
pd.concat(data).to_csv('test.csv')
输出:
0 1
0 Class Public
1 State Chandigarh
2 Company Type Other than Govt.
3 RoC RoC-Chandigarh
4 Sub Category Company limited by shares
5 Listing Status Listed
0 Average Net Profit 0
1 CSR Prescribed Expenditure 0
2 CSR Spent 0
3 Local Area Spent 0
我正在抓取网站 table 表单 https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 但我没有得到我正在寻找的确切结果。我想要这个 link、“公司名称”、“Class”、“州”、“公司类型”、“RoC”、“子类别”、“列表状态”中的 11 列。这是 7 列,之后您可以看到一个展开按钮“2017-18 财年的 CSR 详细信息”,当您单击该按钮时,您将获得另外 4 列“平均净利润”、“CSR 规定支出”、“CSR 支出” ", "当地花费"。我想要 csv 文件中的所有这些列。我写了一段代码,但它不能正常工作。我附上结果图片以供参考。这是我的代码。请帮助获取这些数据。
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv
driver = webdriver.Chrome()
url_file = "csrdata.txt"
with open(url_file, "r") as url:
url_pages = url.read()
# we need to split each urls into lists to make it iterable
pages = url_pages.split("\n") # Split by lines using \n
data = []
# now we run a for loop to visit the urls one by one
for single_page in pages:
driver.get(single_page)
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
driver.find_element_by_link_text("CSR Details of FY 2017-18").click()
table = driver.find_elements_by_xpath("//*[contains(@id,'colfy4')]")
about = table.__getitem__(0).text
x = about.split('\n')
print(x)
data.append(x)
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('csr.csv')
您不需要使用 selenium,因为所有信息都在 html 代码中。您也可以使用 pandas 内置函数 pd_read_html()
直接将 html-table 转换为数据帧。
data = []
for single_page in pages:
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('table') #finds all tables
table_top = pd.read_html(str(table))[0] #the top table
try: #try to get the other table if exists
table_extra = pd.read_html(str(table))[7]
except:
table_extra = pd.DataFrame()
result = pd.concat([table_top, table_extra])
data.append(result)
pd.concat(data).to_csv('test.csv')
输出:
0 1
0 Class Public
1 State Chandigarh
2 Company Type Other than Govt.
3 RoC RoC-Chandigarh
4 Sub Category Company limited by shares
5 Listing Status Listed
0 Average Net Profit 0
1 CSR Prescribed Expenditure 0
2 CSR Spent 0
3 Local Area Spent 0