BeautifulSoup：在 Python 中提取 "img alt" 内容 Web 抓取

Question

我在 python 工作 3. 我的 objective 正在提取一个 table 的不同值并将它们放入不同列表中。

问题是我无法在 td 中获取 "img alt" 的值。

这是我的代码：

    from bs4 import BeautifulSoup
import urllib.request

redditFile = urllib.request.urlopen("http://www.mtggoldfish.com/movers/online/all")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
all_tables = soup.find_all('table')

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')

#create a list
A=[]
B=[]
C=[]
D=[]

for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    increment = row.findAll('span')
    colection = row.findAll('img')
    link = row.findAll('a')
    if len(cells) == 6:
        A.append(cells[0].find(text=True))
        B.append(increment[0].find(text=True))
        C.append(colection[0])
        D.append(link[0].find(text=True))
print(A)
print(B)
print(C)
print(D)

这段代码给出了这个结果：

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
['+8.40', '+2.47', '+1.35', '+1.28', '+1.14', '+0.99', '+0.94', '+0.91', '+0.90', '+0.75']
[<img alt="ORI" class="sprite-set_symbols_ORI" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="PRM" class="sprite-set_symbols_PRM" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="8ED" class="sprite-set_symbols_8ED" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="EX" class="sprite-set_symbols_EX" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="TSB" class="sprite-set_symbols_TSB" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="WL" class="sprite-set_symbols_WL"

src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, , , , ] ["Jace, Vryn's Prodigy", "Gaea's Cradle", 'Ensnaring Bridge', 'City of Traitors', 'Pendelhaven', 'Firestorm', 'Kor Spiritdancer', 'Scalding Tarn', 'Daybreak Coronet', 'Grove of the Burnwillows']

但我需要 IMG ALT VALUE（例如第一个 img alt 值是 "ORI"）

colection variable

我不知道我能做什么。伙计们，你能帮我解决这个问题吗？

非常感谢

Answer 1

一旦你有一个 <img> 节点实例，你可以使用这个获得 alt 值：

alt_tag = img.attrs['alt']

由于您获得了 img 元素的集合，您可以对其进行迭代并检索每个元素的 alt 标签：

tags = []
collection = soup.findAll("img")
for img in collection:
    if 'alt' in img.attrs:
        tags.append(img.attrs['alt'])
#do whatever you need to do with your list of alt attributes.
print tags

Answer 2

如果您只需要 img 标签中的 alt，您只需 select table 中的 img 标签并提取 alt 属性：

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')

print([img["alt"] for img in right_table.select("img[alt]")])
['ORI', 'PRM', '8ED', 'EX', 'TSB', 'WL', 'ROE', 'ZEN', 'FUT', 'FUT']

在你自己的循环中，当你似乎只想要一个元素时，你正在使用 findAll ，如果你只想要第一个元素，那么使用 find row.find('span') 等。并且 row.find('img')["alt"] 会给你 alt每行的值，查看页面，每个 tr 只有一个，因此您绝对不需要 findAll。

如果您想在本地重新创建 table，我会将数据放入字典中：

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')


table_dict = {}

for row in right_table.select("tr"):
    # increase class are where increments are 
    increments = [s.text for s in row.select('span.increase')]
    # make sure we have some data in tr
    if increments:
        # rank/place is first text in td, could also use find("td",{"class":"first-right"})
        place = int(row.td.text) 
        # text/character name is in a tag text
        title = row.find("a").text
        increments.append(title)
       # get alt attribute from img tag
        increments.append(row.find("img")["alt"])
        table_dict[place] = increments

from pprint import pprint as pp

pp(table_dict)

输出：

{1: [u'+8.78', u'68.03', u'+15.00%', u"Jace, Vryn's Prodigy", 'ORI'],
 2: [u'+2.47', u'47.96', u'+5.00%', u"Gaea's Cradle", 'PRM'],
 3: [u'+1.95', u'20.37', u'+11.00%', u'Firestorm', 'WL'],
 4: [u'+1.73', u'23.91', u'+8.00%', u'Force of Will', 'VMA'],
 5: [u'+1.35', u'40.88', u'+3.00%', u'Ensnaring Bridge', '8ED'],
 6: [u'+1.28', u'44.02', u'+3.00%', u'City of Traitors', 'EX'],
 7: [u'+1.15', u'41.98', u'+3.00%', u'Time Walk', 'VMA'],
 8: [u'+1.01', u'28.68', u'+4.00%', u'Daze', 'NE'],
 9: [u'+1.01', u'19.96', u'+5.00%', u"Goryo's Vengeance", 'BOK'],
 10: [u'+1.00', u'3.99', u'+33.00%', u'Unearth', 'UL']}

您将看到的与当前 table 数据完全匹配，如果您想要所有获胜者，只需将 url 更改为 http://www.mtggoldfish.com/movers-details/online/all/winners/dod

或者如果你想分解字段并只拉第一个增量：

for row in right_table.select("tr"):
    increment = row.find('span',{"class":'increase'})
    if increment:
        increment = increment.text
        place = int(row.td.text)
        title = row.select("a[data-full-image]")[0].text
        alt = (row.find("img")["alt"])
        table_dict[place] = {"title":title,"alt":alt, "inc":increment}


from pprint import pprint as pp

pp(table_dict)

输出：

{1: {'alt': 'ORI', 'inc': u'+8.78', 'title': u"Jace, Vryn's Prodigy"},
 2: {'alt': 'PRM', 'inc': u'+2.47', 'title': u"Gaea's Cradle"},
 3: {'alt': 'WL', 'inc': u'+1.95', 'title': u'Firestorm'},
 4: {'alt': 'VMA', 'inc': u'+1.73', 'title': u'Force of Will'},
 5: {'alt': '8ED', 'inc': u'+1.35', 'title': u'Ensnaring Bridge'},
 6: {'alt': 'EX', 'inc': u'+1.28', 'title': u'City of Traitors'},
 7: {'alt': 'VMA', 'inc': u'+1.15', 'title': u'Time Walk'},
 8: {'alt': 'NE', 'inc': u'+1.01', 'title': u'Daze'},
 9: {'alt': 'BOK', 'inc': u'+1.01', 'title': u"Goryo's Vengeance"},
 10: {'alt': 'UL', 'inc': u'+1.00', 'title': u'Unearth'}}

BeautifulSoup：在 Python 中提取 "img alt" 内容 Web 抓取

BeautifulSoup: Extract "img alt" content Web Scraping in Python

python

screen-scraping

web