从 Python 中的 HTML 抓取嵌入式 Google Sheet

Question

这个对我来说比较棘手。我正在尝试从 python.

中的 google sheet 中提取嵌入的 table

这是link

我没有 sheet 但它是公开的。

到目前为止，这是我的代码，当我去输出 headers 时，它向我显示“”。任何帮助将不胜感激。最终目标是将此 table 转换为 pandas DF。谢谢大家

import lxml.html as lh
import pandas as pd

url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'

page = requests.get(url)

doc = lh.fromstring(page.content)

tr_elements = doc.xpath('//tr')

col = []
i = 0

for t in tr_elements[0]:
    i +=1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

Answer 1

好吧，如果您想将数据放入 DataFrame 中，您可以直接加载它：

df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727', 
                  header=1)[0]
df.drop(columns='1', inplace=True)  # remove unnecessary index column called "1"

这会给你：

                               Target Ticker                   Acquirer  \
0       Acacia Communications Inc Com   ACIA      Cisco Systems Inc Com   
1  Advanced Disposal Services Inc Com   ADSW   Waste Management Inc Com   
2                    Allergan Plc Com    AGN             Abbvie Inc Com   
3           Ak Steel Holding Corp Com    AKS   Cleveland Cliffs Inc Com   
4      Td Ameritrade Holding Corp Com   AMTD  Schwab (Charles) Corp Com   

  Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced  \
0     CSCO        .79          .00      .21  1.76%       7/9/2019   
1       WM        .93          .15      [=11=].22  0.67%      4/15/2019   
2     ABBV       7.05         0.22      .17  1.61%      6/25/2019   
3      CLF         .98           .02      [=11=].04  1.34%      12/3/2019   
4     SCHW        .31          .27      .96  3.97%     11/25/2019   

  Deal Type  
0      Cash  
1      Cash  
2       C&S  
3     Stock  
4     Stock

注意read_htmlreturns一个列表。在这种情况下只有 1 DataFrame，所以我们可以参考第一个也是唯一的索引位置[0]

从 Python 中的 HTML 抓取嵌入式 Google Sheet

Scrape Embedded Google Sheet from HTML in Python

python

google-sheets

scrape