从 Python 中的 HTML 抓取嵌入式 Google Sheet
Scrape Embedded Google Sheet from HTML in Python
这个对我来说比较棘手。我正在尝试从 python.
中的 google sheet 中提取嵌入的 table
这是link
我没有 sheet 但它是公开的。
到目前为止,这是我的代码,当我去输出 headers 时,它向我显示“”。任何帮助将不胜感激。最终目标是将此 table 转换为 pandas DF。谢谢大家
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
好吧,如果您想将数据放入 DataFrame 中,您可以直接加载它:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
这会给你:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO .79 .00 .21 1.76% 7/9/2019
1 WM .93 .15 [=11=].22 0.67% 4/15/2019
2 ABBV 7.05 0.22 .17 1.61% 6/25/2019
3 CLF .98 .02 [=11=].04 1.34% 12/3/2019
4 SCHW .31 .27 .96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
注意read_html
returns一个列表。在这种情况下只有
1 DataFrame,所以我们可以参考第一个也是唯一的索引位置[0]
这个对我来说比较棘手。我正在尝试从 python.
中的 google sheet 中提取嵌入的 table这是link
我没有 sheet 但它是公开的。
到目前为止,这是我的代码,当我去输出 headers 时,它向我显示“”。任何帮助将不胜感激。最终目标是将此 table 转换为 pandas DF。谢谢大家
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
好吧,如果您想将数据放入 DataFrame 中,您可以直接加载它:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
这会给你:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO .79 .00 .21 1.76% 7/9/2019
1 WM .93 .15 [=11=].22 0.67% 4/15/2019
2 ABBV 7.05 0.22 .17 1.61% 6/25/2019
3 CLF .98 .02 [=11=].04 1.34% 12/3/2019
4 SCHW .31 .27 .96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
注意read_html
returns一个列表。在这种情况下只有
1 DataFrame,所以我们可以参考第一个也是唯一的索引位置[0]