使用 python 从网站上抓取 table 并尝试获取内容与文本的超链接
Scraping a table from website using python and trying to get the hyperlink of content with text
我正在学习 python,我正在尝试从 https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html 网站抓取 table。在这个 table 你可以看到有 4 列“CIN”、Company Name、“Roc”和“Status”。正如你所看到的“Company Name”是一个超链接,我需要 5 列“CIN”, “公司名称”、“公司 Link”、“Roc”和“状态”。为此,我写了一个代码,但我只有 4 列,而不是“公司 Link”,我得到了不同的结果。我正在分享我的输出 csv 文件的屏幕截图。
请帮助我在“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”5 列中抓取此 table。这是我的代码,请找到我的输出 csv 文件的图像。
import csv
from bs4 import BeautifulSoup
import re
import html5lib
def find_between(s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
loop = 1
while(True):
try:
URL = "https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-" + str(loop) + "-company.html"
loop=loop+1
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
tbody = soup.find('tbody')
rows = tbody.find_all('tr')
row_list = list()
for tr in rows:
row=[]
td = tr.find_all('td')
for a in td:
href=a.find('a',href=True)
if href==None:
row.append(a.text.strip())
print(a.text.strip())
else:
linktext = href.__getitem__
row.append(linktext)
row_list.append(row)
with open('zaubadata.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
for r in row_list:
writer.writerow(r)
except Exception as obj:
print(obj)
csvFile.close()
break
[![result of above code in 4 columns][1]][1]
[1]: https://i.stack.imgur.com/oUVLK.png
此脚本遍历所有页面并将列“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”写入 data.csv
:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-{}-company.html'
page = 1
all_data = []
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
rows = soup.select('#table tr:has(td)')
if not rows:
break
for tr in rows:
all_data.append([td.get_text(strip=True) for td in tr.select('td')])
all_data[-1].insert(2, tr.a['href'])
print(all_data[-1])
page += 1
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(["CIN", "Company Name", "Company Link", "Roc", "Status"])
for row in all_data:
csv_writer.writerow(row)
输出 data.csv
(来自 LibreOffice 的屏幕截图):
我将使用 pandas 举一页的例子。你可以对其余的做同样的事情
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html")
soup = BeautifulSoup(res.text, "lxml")
table = soup.find("table", {"id":"table"})
tr = table.find_all("tr")
headers = [x.text.strip() for x in tr[0].find_all("th")]
headers.append("link")
rows = []
for row in tr[1:]:
tds = row.find_all("td")
temp = [td.text.strip() for td in tds]
temp.append(tds[1].find("a")["href"])
rows.append(temp)
df = pd.DataFrame(rows, columns = headers)
print(df)
# save df
df.to_csv("page-1.csv", index=False)
数据帧:
CIN Company RoC Status link
0 U65992DL1988PTC030513 SHUBHAM CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SHUBHAM-CHIT...
1 U74999DL2016PTC305850 AKS INDIA PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/AKS-INDIA-PR...
2 U74999DL2018NPL328316 MYAKS INDIA FOUNDATION Delhi Active https://www.zaubacorp.com/company/MYAKS-INDIA-...
3 U55204DL2001PTC109941 PARADIGM HOSPITALITY PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/PARADIGM-HOS...
4 U65992DL2000PTC105515 VNS CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/VNS-CHIT-FUN...
5 AAL-1972 RYSN INFRA LLP Delhi Active https://www.zaubacorp.com/company/RYSN-INFRA-L...
6 AAL-8304 REAL HARVEST LLP Delhi Active https://www.zaubacorp.com/company/REAL-HARVEST...
7 U33309DL2017PTC318412 ARSHAD SPECTS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ARSHAD-SPECT...
8 U70109DL2010PTC208722 INSAAF BUILDWELL PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/INSAAF-BUILD...
9 U74899DL1991PTC046359 SYMPHONY TRAVELS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SYMPHONY-TRA...
10 U63010DL2009PTC194162 SYNAPSES ADVENTURES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SYNAPSES-ADV...
11 U65992DL1986PTC024128 VASU CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/VASU-CHIT-FU...
12 U45309DL2017PTC322998 NAGARJUNA CONTRACTING PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/NAGARJUNA-CO...
13 U51109DL2008PTC176009 DINCO MOTORS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/DINCO-MOTORS...
14 U45201DL2017PTC322910 NAGARJUNA INFRA PROJECTS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/NAGARJUNA-IN...
15 U74300DL2005PLC143427 INDIA NEWS COMMUNICATIONS LIMITED Delhi Active https://www.zaubacorp.com/company/INDIA-NEWS-C...
16 U74899DL1974PTC007374 GOLDEN TEXTILES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/GOLDEN-TEXTI...
17 U29300DL2016PTC300009 GREENDAY INFOTECH PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/GREENDAY-INF...
18 U72900DL2019PTC344741 L2W SYSTEMS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/L2W-SYSTEMS-...
19 U74899DL1987PTC027094 HI-TECH OILS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/HI-TECH-OILS...
20 AAG-0149 ALGO WIL INDIA LLP Delhi Active https://www.zaubacorp.com/company/ALGO-WIL-IND...
21 U67120DL2000PTC107212 ANGEL BUSINESS SERVICES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ANGEL-BUSINE...
22 U51502DL2013PTC257933 STAR FLEX INDIA PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/STAR-FLEX-IN...
23 U63030DL2020PTC361756 LOG29 CARGO MOVERS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/LOG29-CARGO-...
24 U72900DL2020PTC361739 ITONIC SOFTWARE PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ITONIC-SOFTW...
25 U70109DL2020PTC361981 POLWELL REAL ESTATES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/POLWELL-REAL...
26 U74999DL2016PTC306247 RAJBALA RBR REALCON PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/RAJBALA-RBR-...
27 AAI-3926 JAIN PHARMACY LLP Delhi Active https://www.zaubacorp.com/company/JAIN-PHARMAC...
28 U31906DL2020PTC360868 YASTRA TECH PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/YASTRA-TECH-...
29 U51101DL2014PTC268470 MRIDUL INTERNATIONAL PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/MRIDUL-INTER...
我正在学习 python,我正在尝试从 https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html 网站抓取 table。在这个 table 你可以看到有 4 列“CIN”、Company Name、“Roc”和“Status”。正如你所看到的“Company Name”是一个超链接,我需要 5 列“CIN”, “公司名称”、“公司 Link”、“Roc”和“状态”。为此,我写了一个代码,但我只有 4 列,而不是“公司 Link”,我得到了不同的结果。我正在分享我的输出 csv 文件的屏幕截图。
请帮助我在“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”5 列中抓取此 table。这是我的代码,请找到我的输出 csv 文件的图像。
import csv
from bs4 import BeautifulSoup
import re
import html5lib
def find_between(s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
loop = 1
while(True):
try:
URL = "https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-" + str(loop) + "-company.html"
loop=loop+1
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
tbody = soup.find('tbody')
rows = tbody.find_all('tr')
row_list = list()
for tr in rows:
row=[]
td = tr.find_all('td')
for a in td:
href=a.find('a',href=True)
if href==None:
row.append(a.text.strip())
print(a.text.strip())
else:
linktext = href.__getitem__
row.append(linktext)
row_list.append(row)
with open('zaubadata.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
for r in row_list:
writer.writerow(r)
except Exception as obj:
print(obj)
csvFile.close()
break
[![result of above code in 4 columns][1]][1]
[1]: https://i.stack.imgur.com/oUVLK.png
此脚本遍历所有页面并将列“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”写入 data.csv
:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-{}-company.html'
page = 1
all_data = []
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
rows = soup.select('#table tr:has(td)')
if not rows:
break
for tr in rows:
all_data.append([td.get_text(strip=True) for td in tr.select('td')])
all_data[-1].insert(2, tr.a['href'])
print(all_data[-1])
page += 1
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(["CIN", "Company Name", "Company Link", "Roc", "Status"])
for row in all_data:
csv_writer.writerow(row)
输出 data.csv
(来自 LibreOffice 的屏幕截图):
我将使用 pandas 举一页的例子。你可以对其余的做同样的事情
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html")
soup = BeautifulSoup(res.text, "lxml")
table = soup.find("table", {"id":"table"})
tr = table.find_all("tr")
headers = [x.text.strip() for x in tr[0].find_all("th")]
headers.append("link")
rows = []
for row in tr[1:]:
tds = row.find_all("td")
temp = [td.text.strip() for td in tds]
temp.append(tds[1].find("a")["href"])
rows.append(temp)
df = pd.DataFrame(rows, columns = headers)
print(df)
# save df
df.to_csv("page-1.csv", index=False)
数据帧:
CIN Company RoC Status link
0 U65992DL1988PTC030513 SHUBHAM CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SHUBHAM-CHIT...
1 U74999DL2016PTC305850 AKS INDIA PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/AKS-INDIA-PR...
2 U74999DL2018NPL328316 MYAKS INDIA FOUNDATION Delhi Active https://www.zaubacorp.com/company/MYAKS-INDIA-...
3 U55204DL2001PTC109941 PARADIGM HOSPITALITY PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/PARADIGM-HOS...
4 U65992DL2000PTC105515 VNS CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/VNS-CHIT-FUN...
5 AAL-1972 RYSN INFRA LLP Delhi Active https://www.zaubacorp.com/company/RYSN-INFRA-L...
6 AAL-8304 REAL HARVEST LLP Delhi Active https://www.zaubacorp.com/company/REAL-HARVEST...
7 U33309DL2017PTC318412 ARSHAD SPECTS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ARSHAD-SPECT...
8 U70109DL2010PTC208722 INSAAF BUILDWELL PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/INSAAF-BUILD...
9 U74899DL1991PTC046359 SYMPHONY TRAVELS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SYMPHONY-TRA...
10 U63010DL2009PTC194162 SYNAPSES ADVENTURES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/SYNAPSES-ADV...
11 U65992DL1986PTC024128 VASU CHIT FUND PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/VASU-CHIT-FU...
12 U45309DL2017PTC322998 NAGARJUNA CONTRACTING PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/NAGARJUNA-CO...
13 U51109DL2008PTC176009 DINCO MOTORS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/DINCO-MOTORS...
14 U45201DL2017PTC322910 NAGARJUNA INFRA PROJECTS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/NAGARJUNA-IN...
15 U74300DL2005PLC143427 INDIA NEWS COMMUNICATIONS LIMITED Delhi Active https://www.zaubacorp.com/company/INDIA-NEWS-C...
16 U74899DL1974PTC007374 GOLDEN TEXTILES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/GOLDEN-TEXTI...
17 U29300DL2016PTC300009 GREENDAY INFOTECH PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/GREENDAY-INF...
18 U72900DL2019PTC344741 L2W SYSTEMS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/L2W-SYSTEMS-...
19 U74899DL1987PTC027094 HI-TECH OILS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/HI-TECH-OILS...
20 AAG-0149 ALGO WIL INDIA LLP Delhi Active https://www.zaubacorp.com/company/ALGO-WIL-IND...
21 U67120DL2000PTC107212 ANGEL BUSINESS SERVICES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ANGEL-BUSINE...
22 U51502DL2013PTC257933 STAR FLEX INDIA PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/STAR-FLEX-IN...
23 U63030DL2020PTC361756 LOG29 CARGO MOVERS PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/LOG29-CARGO-...
24 U72900DL2020PTC361739 ITONIC SOFTWARE PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/ITONIC-SOFTW...
25 U70109DL2020PTC361981 POLWELL REAL ESTATES PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/POLWELL-REAL...
26 U74999DL2016PTC306247 RAJBALA RBR REALCON PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/RAJBALA-RBR-...
27 AAI-3926 JAIN PHARMACY LLP Delhi Active https://www.zaubacorp.com/company/JAIN-PHARMAC...
28 U31906DL2020PTC360868 YASTRA TECH PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/YASTRA-TECH-...
29 U51101DL2014PTC268470 MRIDUL INTERNATIONAL PRIVATE LIMITED Delhi Active https://www.zaubacorp.com/company/MRIDUL-INTER...