抓取 table 数据并在 python 的生产服务器中每 5 分钟插入一个列表

Question

我正在尝试从 table 的这个站点提取数据，其中有 382 个 rows.This 是站点： https://www.dsebd.org/latest_share_price_scroll_l.php

我正在使用 beautifulsoup 进行抓取，我希望这个程序每 5 分钟运行 schedule.I 我正在尝试将值插入 json 列表中382 行，不包括 header，第一列 numbering.this 是我的代码：

import requests
from bs4 import BeautifulSoup


def convert_to_html5lib(URL, my_list):
    r = requests.get(URL)
    # Create a BeautifulSoup object
    soup = BeautifulSoup(r.content, 'html5lib')
    soup.prettify()

    # result = soup.find_all("div")[1].get_text()
    result = soup.find('table', {'class': 'table table-bordered background-white shares-table fixedHeader'}).get_text()
    # result = result.find('tbody')
    print(result)
    for item in result.split():
        my_list.append(item)
    print(my_list)

    # return


details_list = []
convert_to_html5lib("http://www.dsebd.org/latest_share_price_scroll_l.php", details_list)
counter = 0
while counter < len(details_list):
    if counter == 0:
        company_name = details_list[counter]
        counter += 1
    last_trading_price = details_list[counter]
    counter += 1
    last_change_price_in_value = details_list[counter]
    counter += 1
schedule.every(5).minutes.do(scrape_stock)

但是我没有得到 table.I 的所有值想要 382 行的所有数据 table 作为一个列表所以稍后我可以将它保存到 database.But 我不是得到任何结果，调度程序也没有 working.What 我在这里做错了吗？

Answer 1

你可以先查看我的代码获取table中的所有数据。由于这里的数据一直在更新，我觉得还是用selenium比较好。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://www.dsebd.org/latest_share_price_scroll_l.php"

driver = webdriver.Firefox(executable_path="") // Insert your webdriver path please

driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'table table-bordered background-white shares-table fixedHeader'})

df = pd.read_html(str(table))

print(df)

输出：

[     Unnamed: 0  Unnamed: 1  Unnamed: 2  ...  Unnamed: 8  Unnamed: 9  Unnamed: 10
0             1   1JANATAMF         6.7  ...         137       4.022       605104
1             2  1STPRIMFMF        21.5  ...         215       5.193       243258
2             3    AAMRANET        52.4  ...        1227      65.793      1264871
3             4   AAMRATECH        31.5  ...         675      37.861      1218353
4             5    ABB1STMF         5.9  ...          57       2.517       428672
..          ...         ...         ...  ...         ...         ...          ...
377         378  WMSHIPYARD        11.2  ...         835      14.942      1374409
378         379         YPL        11.3  ...         247       4.863       434777
379         380  ZAHEENSPIN         8.8  ...         174       2.984       342971
380         381    ZAHINTEX         7.7  ...         111       1.301       174786
381         382  ZEALBANGLA       120.0  ...         102       0.640         5271

[382 rows x 11 columns]]

Answer 2

您可以使用 BeautifulSoup 满足要求

这里有些地方不对

仅抓取 1 行。
以正确的方式使用计划库。（参考：https://www.geeksforgeeks.org/python-schedule-library/）

这里是你的解决方案，有改动：

import schedule
import time
from bs4 import BeautifulSoup
import requests

def convert_to_html5lib(url,details_list):
    # Make a GET request to fetch the raw HTML content
    html_content = requests.get(url).text
    # Parse the html content
    soup = BeautifulSoup(html_content, "lxml")
    # extract table from webpage
    table = soup.find("table", { "class" : "table table-bordered background-white shares-table fixedHeader" })
    rows = table.find_all('tr')
        for row in rows:
        cols=row.find_all('td')
        # remove first element from row
        cols=[x.text.strip() for x in cols[1:]]
        details_list.append(cols)
        print(cols)
        # return

details_list = []
counter = 0
url="http://www.dsebd.org/latest_share_price_scroll_l.php"
# schedule job for every 5 mins   
schedule.every(5).minutes.do(convert_to_html5lib,url,details_list)
# same as your logic
while counter < len(details_list):
    if counter == 0:
        company_name = details_list[counter]
        counter += 1
    last_trading_price = details_list[counter]
    counter += 1
    last_change_price_in_value = details_list[counter]
    counter += 1
# scheduler wait for 5 mins
while True:
    schedule.run_pending()
    time.sleep(5)

抓取 table 数据并在 python 的生产服务器中每 5 分钟插入一个列表

scraping a table data and insert in a list with every 5 min schedule in production server in python

python

scheduler

beautifulsoup

web-scraping