如何使用 Beautiful Soup 提取带有文本的 "alt"

Question

刚发现Beautiful Soup，好像很厉害。我想知道是否有一种简单的方法可以用文本提取 "alt" 字段。一个简单的例子是

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

这将导致

在乐团的不同部分中，您会发现：

字符串中的A

黄铜中的A

木管乐器中的 A

但我希望在文本提取中包含 alt 字段，这会给

在乐团的不同部分中，您会发现：

弦上的小提琴

铜管中的小号

单簧管和木管乐器萨克斯

谢谢

Answer 1

a = soup.findAll('img')

for every in a:
    print(every['alt'])

这样就可以了。

1.line 找到所有 IMG（我们使用 .findAll）

或正文

print (a.text)
for eachline in a:
    print(eachline.text)

遍历每个结果的简单 for 循环或手动 soup.findAll('img')[0] 然后 soup.findAll('img')[1]..等等

Answer 2

请考虑这种方法。

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p')   # get all tags of type <p>

for tag in ptag:
    instrument = tag.find('img')    # search for <img>
    if instrument:  # if we found an <img> tag...
        # ...create a new string with the content of 'alt' in the middle if 'tag.text'
        temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
        print(temp) # print
    else:   # if we haven't found an <img> tag we just print 'tag.text'
        print(tag.text)

输出为

Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

策略是：

找到所有 <p> 个标签
在这些 <p> 个标签中搜索 <img> 个标签
如果我们找到 <img> 标签，将其 alt 属性的内容插入 tag.text 并打印出来
如果我们没有找到 <img> 标签就打印出来

如何使用 Beautiful Soup 提取带有文本的 "alt"

How to extract "alt" with text with Beautiful Soup

python

beautifulsoup

alt