使用 python 从网站上抓取 table 并尝试获取内容与文本的超链接

Scraping a table from website using python and trying to get the hyperlink of content with text

我正在学习 python,我正在尝试从 https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html 网站抓取 table。在这个 table 你可以看到有 4 列“CIN”、Company Name、“Roc”和“Status”。正如你所看到的“Company Name”是一个超链接,我需要 5 列“CIN”, “公司名称”、“公司 Link”、“Roc”和“状态”。为此,我写了一个代码,但我只有 4 列,而不是“公司 Link”,我得到了不同的结果。我正在分享我的输出 csv 文件的屏幕截图。

请帮助我在“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”5 列中抓取此 table。这是我的代码,请找到我的输出 csv 文件的图像。

import csv
from bs4 import BeautifulSoup
import re
import html5lib

def find_between(s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

loop = 1
while(True):
    try:
        URL = "https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-" + str(loop) + "-company.html"
        loop=loop+1
        r = requests.get(URL)
        soup = BeautifulSoup(r.content, 'html5lib')
        tbody = soup.find('tbody')
        rows = tbody.find_all('tr')
        row_list = list()
        for tr in rows:
            row=[]
            td = tr.find_all('td')
            for a in td:
                href=a.find('a',href=True)
                if href==None:
                    row.append(a.text.strip())
                    print(a.text.strip())
                else:
                    linktext = href.__getitem__
                    row.append(linktext)
            row_list.append(row)
        with open('zaubadata.csv', 'a') as csvFile:
            writer = csv.writer(csvFile)
            for r in row_list:
                writer.writerow(r)
    except Exception as obj:
        print(obj)
        csvFile.close()
        break




[![result of above code in 4 columns][1]][1]


  [1]: https://i.stack.imgur.com/oUVLK.png

此脚本遍历所有页面并将列“CIN”、“公司名称”、“公司 Link”、“Roc”和“状态”写入 data.csv:

import csv
import requests
from bs4 import BeautifulSoup


url = 'https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-{}-company.html'

page = 1
all_data = []
while True:
    soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')

    rows = soup.select('#table tr:has(td)')

    if not rows:
        break

    for tr in rows:
        all_data.append([td.get_text(strip=True) for td in tr.select('td')])
        all_data[-1].insert(2, tr.a['href'])
        print(all_data[-1])

    page += 1

with open('data.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(["CIN", "Company Name", "Company Link", "Roc", "Status"])
    for row in all_data:
        csv_writer.writerow(row)

输出 data.csv(来自 LibreOffice 的屏幕截图):

我将使用 pandas 举一页的例子。你可以对其余的做同样的事情

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html")
soup = BeautifulSoup(res.text, "lxml")
table = soup.find("table", {"id":"table"})
tr = table.find_all("tr")
headers = [x.text.strip() for x in tr[0].find_all("th")]
headers.append("link")

rows = []
for row in tr[1:]:
    tds = row.find_all("td")
    temp = [td.text.strip() for td in tds]
    temp.append(tds[1].find("a")["href"])
    rows.append(temp)

df = pd.DataFrame(rows, columns = headers)
print(df)

# save df
df.to_csv("page-1.csv", index=False)

数据帧:

   CIN                                   Company    RoC  Status                                               link
0   U65992DL1988PTC030513         SHUBHAM CHIT FUND PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/SHUBHAM-CHIT...
1   U74999DL2016PTC305850                 AKS INDIA PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/AKS-INDIA-PR...
2   U74999DL2018NPL328316                    MYAKS INDIA FOUNDATION  Delhi  Active  https://www.zaubacorp.com/company/MYAKS-INDIA-...
3   U55204DL2001PTC109941      PARADIGM HOSPITALITY PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/PARADIGM-HOS...
4   U65992DL2000PTC105515             VNS CHIT FUND PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/VNS-CHIT-FUN...
5                AAL-1972                            RYSN INFRA LLP  Delhi  Active  https://www.zaubacorp.com/company/RYSN-INFRA-L...
6                AAL-8304                          REAL HARVEST LLP  Delhi  Active  https://www.zaubacorp.com/company/REAL-HARVEST...
7   U33309DL2017PTC318412             ARSHAD SPECTS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/ARSHAD-SPECT...
8   U70109DL2010PTC208722          INSAAF BUILDWELL PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/INSAAF-BUILD...
9   U74899DL1991PTC046359          SYMPHONY TRAVELS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/SYMPHONY-TRA...
10  U63010DL2009PTC194162       SYNAPSES ADVENTURES PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/SYNAPSES-ADV...
11  U65992DL1986PTC024128            VASU CHIT FUND PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/VASU-CHIT-FU...
12  U45309DL2017PTC322998     NAGARJUNA CONTRACTING PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/NAGARJUNA-CO...
13  U51109DL2008PTC176009              DINCO MOTORS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/DINCO-MOTORS...
14  U45201DL2017PTC322910  NAGARJUNA INFRA PROJECTS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/NAGARJUNA-IN...
15  U74300DL2005PLC143427         INDIA NEWS COMMUNICATIONS LIMITED  Delhi  Active  https://www.zaubacorp.com/company/INDIA-NEWS-C...
16  U74899DL1974PTC007374           GOLDEN TEXTILES PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/GOLDEN-TEXTI...
17  U29300DL2016PTC300009         GREENDAY INFOTECH PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/GREENDAY-INF...
18  U72900DL2019PTC344741               L2W SYSTEMS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/L2W-SYSTEMS-...
19  U74899DL1987PTC027094              HI-TECH OILS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/HI-TECH-OILS...
20               AAG-0149                        ALGO WIL INDIA LLP  Delhi  Active  https://www.zaubacorp.com/company/ALGO-WIL-IND...
21  U67120DL2000PTC107212   ANGEL BUSINESS SERVICES PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/ANGEL-BUSINE...
22  U51502DL2013PTC257933           STAR FLEX INDIA PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/STAR-FLEX-IN...
23  U63030DL2020PTC361756        LOG29 CARGO MOVERS PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/LOG29-CARGO-...
24  U72900DL2020PTC361739           ITONIC SOFTWARE PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/ITONIC-SOFTW...
25  U70109DL2020PTC361981      POLWELL REAL ESTATES PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/POLWELL-REAL...
26  U74999DL2016PTC306247       RAJBALA RBR REALCON PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/RAJBALA-RBR-...
27               AAI-3926                         JAIN PHARMACY LLP  Delhi  Active  https://www.zaubacorp.com/company/JAIN-PHARMAC...
28  U31906DL2020PTC360868               YASTRA TECH PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/YASTRA-TECH-...
29  U51101DL2014PTC268470      MRIDUL INTERNATIONAL PRIVATE LIMITED  Delhi  Active  https://www.zaubacorp.com/company/MRIDUL-INTER...