BeautifulSoup 从一组输出子项中选择一个子项

Question

我正在尝试抓取 Yahoo 的一只股票 table。我想为每一行打印出 table 值（有效）。:

from bs4 import BeautifulSoup as bsoup
import urllib2
import re

url = "https://finance.yahoo.com/screener/predefined/undervalued_growth_stocks"

table_page = urllib2.urlopen(url)
soup = bsoup(table_page,'html.parser')

table = soup.find('table')

table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    tdrow = [i.text for i in td]
    print tdrow

这工作正常，并产生（对于每一行）：

[u'AMAT', u'Applied Materials, Inc.', u'58.71', u'+1.09', u'+1.89%', u'7.364M', u'10.282M', u'62.614B', u'20.87', u'']
[u'PK', u'Park Hotels & Resorts Inc.', u'29.01', u'+0.34', u'+1.19%', u'628,369', u'1.216M', u'6.233B', u'2.49', u'']

我想做的是 select 第一个子元素/元素（股票代码，在 "AMAT" 上方），这样我就可以传递它。

如果我用

print tdrow[0]

它会产生一个错误

IndexError: list index out of range

如果我从 "print tdrow[0]" 中删除缩进，它会起作用（我可以指定 [0] 并获得 "PK"，[1] 并获得 "Applied Materials Inc."，但它仅适用于最后一行 - 我想对每一行使用 [0]（在 "for tr in table_row" 循环中）。

我错过了什么？

Answer 1

我认为这是因为第一行是空的（不知道为什么）所以 print tdrow[0] 会抛出一个越界。但是其余的行都很好，所以将打印移到循环外将引用最后一行，这是有效的。

所以检查行是否存在应该解决。

这是我运行您的代码与上面块中显示的完全一样时得到的输出，请注意第一行是空的。

[]
[u'AMAT', u'Applied Materials, Inc.', u'58.74', u'+1.12', u'+1.94%', u'7.725M', u'10.282M', u'62.64B', u'20.88', u'']
[u'WBA', u'Walgreens Boots Alliance, Inc.', u'70.885', u'+0.105', u'+0.148%', u'4.947M', u'7.65M', u'71.562B', u'17.90', u'']

.
.
.

[u'PLD', u'Prologis, Inc.', u'67.29', u'+0.85', u'+1.28%', u'841,231', u'1.969M', u'36.275B', u'20.17', u'']
[u'PK', u'Park Hotels & Resorts Inc.', u'29.01', u'+0.34', u'+1.19%', u'640,075', u'1.216M', u'6.233B', u'2.49', u'']

BeautifulSoup 从一组输出子项中选择一个子项

BeautifulSoup selecting one child from an array of output children

python

children

beautifulsoup