我正在尝试使用 python urllib 和 beautiful soup 从网站获取 table 数据,但它 returns 脚本
I am trying to get table data from a website using python urllib and beautiful soup but it returns script
我试过 BeautifulSoup 但它从 URL 抓取了脚本。
url = 'https://ekartlogistics.com/shipmenttrack/FMPP0944216480'
from bs4 import BeautifulSoup
from urllib import request, parse
read = request.urlopen(url)
soup = BeautifulSoup(read, 'html.parser')
print(soup.prettify())
它 returns 脚本以及其他 HTML 脚本。
我正在尝试从 URL
中获取 table 数据
也许尝试使用 Selenium 无头加载页面然后提取 html?我也无法仅根据请求使用它。
url 是由javascript 动态加载的数据。所以你不能只使用 beautifulsoup 来抓取数据。您可以使用类似 selenium 的自动化工具。这里我用selenium模仿javascript,用pandas抓取table数据如下:
代码:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(5)
driver.get("https://ekartlogistics.com/shipmenttrack/FMPP0944216480")
time.sleep(3)
table = driver.find_element(By.CSS_SELECTOR, 'table.table').get_attribute('outerHTML')
df = pd.read_html(table)[0]
print(df)
输出:
Date Time Place Status
0 Sunday 17 October 04:24:26 PM Kolkata Shipment Created
1 Sunday 17 October 04:24:31 PM Kolkata Dispatched to CentralHub_BAG
2 Sunday 17 October 04:56:00 PM Kolkata Received at CentralHub_BAG
3 Sunday 17 October 04:56:03 PM Kolkata Received at CentralHub_BAG
4 Monday 18 October 03:10:35 AM Patna Dispatched to CentralHub_BHT
5 Tuesday 19 October 04:48:44 AM Patna Received at CentralHub_BHT
6 Tuesday 19 October 05:03:44 PM Samastipur Dispatched to SatelliteHub_SAMA
7 Wednesday 20 October 02:47:44 AM Samastipur Received at SatelliteHub_SAMA
8 Thursday 21 October 09:21:52 AM Samastipur Out For Delivery
9 Friday 22 October 07:38:36 AM Samastipur Delivered
注意:以下解决方案适用于 GOOGLE COLAB。
制作人员:https://whosebug.com/users/12848411/fazlul
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver') # ChromeDriver Path
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage') # All above commands to install Selenium on Colab
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
from selenium.webdriver.common.by import By
import pandas as pd
wd.get("https://ekartlogistics.com/shipmenttrack/FMPP0944216480")
table = wd.find_element(By.CSS_SELECTOR, 'table.table').get_attribute('outerHTML')
df = pd.read_html(table)[0]
print(df)
输出:
Date Time Place Status
0 Sunday 17 October 04:24:26 PM Kolkata Shipment Created
1 Sunday 17 October 04:24:31 PM Kolkata Dispatched to CentralHub_BAG
2 Sunday 17 October 04:56:00 PM Kolkata Received at CentralHub_BAG
3 Sunday 17 October 04:56:03 PM Kolkata Received at CentralHub_BAG
4 Monday 18 October 03:10:35 AM Patna Dispatched to CentralHub_BHT
5 Tuesday 19 October 04:48:44 AM Patna Received at CentralHub_BHT
6 Tuesday 19 October 05:03:44 PM Samastipur Dispatched to SatelliteHub_SAMA
7 Wednesday 20 October 02:47:44 AM Samastipur Received at SatelliteHub_SAMA
8 Thursday 21 October 09:21:52 AM Samastipur Out For Delivery
9 Friday 22 October 07:38:36 AM Samastipur Delivered
我试过 BeautifulSoup 但它从 URL 抓取了脚本。
url = 'https://ekartlogistics.com/shipmenttrack/FMPP0944216480'
from bs4 import BeautifulSoup
from urllib import request, parse
read = request.urlopen(url)
soup = BeautifulSoup(read, 'html.parser')
print(soup.prettify())
它 returns 脚本以及其他 HTML 脚本。
我正在尝试从 URL
中获取 table 数据也许尝试使用 Selenium 无头加载页面然后提取 html?我也无法仅根据请求使用它。
url 是由javascript 动态加载的数据。所以你不能只使用 beautifulsoup 来抓取数据。您可以使用类似 selenium 的自动化工具。这里我用selenium模仿javascript,用pandas抓取table数据如下:
代码:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(5)
driver.get("https://ekartlogistics.com/shipmenttrack/FMPP0944216480")
time.sleep(3)
table = driver.find_element(By.CSS_SELECTOR, 'table.table').get_attribute('outerHTML')
df = pd.read_html(table)[0]
print(df)
输出:
Date Time Place Status
0 Sunday 17 October 04:24:26 PM Kolkata Shipment Created
1 Sunday 17 October 04:24:31 PM Kolkata Dispatched to CentralHub_BAG
2 Sunday 17 October 04:56:00 PM Kolkata Received at CentralHub_BAG
3 Sunday 17 October 04:56:03 PM Kolkata Received at CentralHub_BAG
4 Monday 18 October 03:10:35 AM Patna Dispatched to CentralHub_BHT
5 Tuesday 19 October 04:48:44 AM Patna Received at CentralHub_BHT
6 Tuesday 19 October 05:03:44 PM Samastipur Dispatched to SatelliteHub_SAMA
7 Wednesday 20 October 02:47:44 AM Samastipur Received at SatelliteHub_SAMA
8 Thursday 21 October 09:21:52 AM Samastipur Out For Delivery
9 Friday 22 October 07:38:36 AM Samastipur Delivered
注意:以下解决方案适用于 GOOGLE COLAB。
制作人员:https://whosebug.com/users/12848411/fazlul
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver') # ChromeDriver Path
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage') # All above commands to install Selenium on Colab
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
from selenium.webdriver.common.by import By
import pandas as pd
wd.get("https://ekartlogistics.com/shipmenttrack/FMPP0944216480")
table = wd.find_element(By.CSS_SELECTOR, 'table.table').get_attribute('outerHTML')
df = pd.read_html(table)[0]
print(df)
输出:
Date Time Place Status
0 Sunday 17 October 04:24:26 PM Kolkata Shipment Created
1 Sunday 17 October 04:24:31 PM Kolkata Dispatched to CentralHub_BAG
2 Sunday 17 October 04:56:00 PM Kolkata Received at CentralHub_BAG
3 Sunday 17 October 04:56:03 PM Kolkata Received at CentralHub_BAG
4 Monday 18 October 03:10:35 AM Patna Dispatched to CentralHub_BHT
5 Tuesday 19 October 04:48:44 AM Patna Received at CentralHub_BHT
6 Tuesday 19 October 05:03:44 PM Samastipur Dispatched to SatelliteHub_SAMA
7 Wednesday 20 October 02:47:44 AM Samastipur Received at SatelliteHub_SAMA
8 Thursday 21 October 09:21:52 AM Samastipur Out For Delivery
9 Friday 22 October 07:38:36 AM Samastipur Delivered