如何使用 Selenium 抓取更新 HTML table?
How to scrape an updating HTML table using Selenium?
我想从 link 中抓取硬币 table 并按日期创建一个 CSV 文件。对于每个新硬币更新,都应在现有数据文件中创建顶部的新条目。
期望的输出
Coin,Pings,...Datetime
BTC,25,...07:17:05 03/18/21
我还没走多远,下面是我的尝试
from selenium import webdriver
import numpy as np
import pandas as pd
firefox = webdriver.Firefox(executable_path="/usr/local/bin/geckodriver")
firefox.get('https://agile-cliffs-23967.herokuapp.com/binance/')
rows = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr"))
columns = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr[1]/th"))
df = pd.DataFrame(columns=['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime'])
for r in range(1, rows+1):
for c in range(1, columns+1):
value = firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text
print(value)
# df.loc[i, ['Coin']] =
您可以通过将行数据放入字典来将行数据附加到 DataFrame:
# We reuse the headers when building dicts below
headers = ['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime']
df = pd.DataFrame(columns=headers)
for r in range(1, rows+1):
data = [firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text \
for c in range(1, columns+1)]
row_dict = dict(zip(headers, data))
df = df.append(row_dict, ignore_index=True)
由于数据是动态加载的,您可以直接从源中检索它,不需要 Selenium
。它将 return json 包含具有 |
分隔值的行,这些值需要拆分并且可以附加到 DataFrame
。由于网站每分钟更新一次,您可以将所有内容包装在 while True
中,使代码 运行 every 60 seconds:
import requests
import time
import json
headers = ['Coin','Pings','Net Vol BTC','Net Vol %','Recent Total Vol BTC', 'Recent Vol %', 'Recent Net Vol', 'Datetime (UTC)']
df = pd.DataFrame(columns=headers)
s = requests.Session()
starttime = time.time()
while True:
response = s.get('https://agile-cliffs-23967.herokuapp.com/ok', headers={'Connection': 'keep-alive'})
d = json.loads(response.text)
rows = [str(i).split('|') for i in d['resu'][:-1]]
if rows:
data = [dict(zip(headers, l)) for l in rows]
df = df.append(data, ignore_index=True)
df.to_csv('filename.csv', index=False)
time.sleep(60.0 - ((time.time() - starttime) % 60.0))
我想从 link 中抓取硬币 table 并按日期创建一个 CSV 文件。对于每个新硬币更新,都应在现有数据文件中创建顶部的新条目。
期望的输出
Coin,Pings,...Datetime
BTC,25,...07:17:05 03/18/21
我还没走多远,下面是我的尝试
from selenium import webdriver
import numpy as np
import pandas as pd
firefox = webdriver.Firefox(executable_path="/usr/local/bin/geckodriver")
firefox.get('https://agile-cliffs-23967.herokuapp.com/binance/')
rows = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr"))
columns = len(firefox.find_elements_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr[1]/th"))
df = pd.DataFrame(columns=['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime'])
for r in range(1, rows+1):
for c in range(1, columns+1):
value = firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text
print(value)
# df.loc[i, ['Coin']] =
您可以通过将行数据放入字典来将行数据附加到 DataFrame:
# We reuse the headers when building dicts below
headers = ['Coin','Pings','Net Vol BTC','Net Vol per','Recent Total Vol BTC', 'Recent Vol per', 'Recent Net Vol', 'Datetime']
df = pd.DataFrame(columns=headers)
for r in range(1, rows+1):
data = [firefox.find_element_by_xpath("/html/body/div/section[2]/div/div/div/div/table/tr["+str(r)+"]/th["+str(c)+"]").text \
for c in range(1, columns+1)]
row_dict = dict(zip(headers, data))
df = df.append(row_dict, ignore_index=True)
由于数据是动态加载的,您可以直接从源中检索它,不需要 Selenium
。它将 return json 包含具有 |
分隔值的行,这些值需要拆分并且可以附加到 DataFrame
。由于网站每分钟更新一次,您可以将所有内容包装在 while True
中,使代码 运行 every 60 seconds:
import requests
import time
import json
headers = ['Coin','Pings','Net Vol BTC','Net Vol %','Recent Total Vol BTC', 'Recent Vol %', 'Recent Net Vol', 'Datetime (UTC)']
df = pd.DataFrame(columns=headers)
s = requests.Session()
starttime = time.time()
while True:
response = s.get('https://agile-cliffs-23967.herokuapp.com/ok', headers={'Connection': 'keep-alive'})
d = json.loads(response.text)
rows = [str(i).split('|') for i in d['resu'][:-1]]
if rows:
data = [dict(zip(headers, l)) for l in rows]
df = df.append(data, ignore_index=True)
df.to_csv('filename.csv', index=False)
time.sleep(60.0 - ((time.time() - starttime) % 60.0))