使用 BeautifulSoup 解析来自维基百科的页面时出现问题

Problem parsing page from Wikipedia with BeautifulSoup

我有一个非常简单的测试脚本,用于从维基百科获取文章并获取出现在页面中的文本的第一段(即 summary)。

这里是:

from bs4 import BeautifulSoup
import urllib2

url = "https://en.wikipedia.org/wiki/Vicia_faba" 
print url
source = urllib2.urlopen(url)
soup = BeautifulSoup(source, 'lxml')
print soup
summary = soup.find('p').getText()
print summary

解析 summary 时我什么也没得到,尽管页面已成功获取并正确传递给 BeautifulSoup

这看起来是一个很简单的问题,但我无法进一步深入。 BeautifulSoup 充满了技巧,但不幸的是我不知道其中的许多技巧!

提前感谢您的任何提示或建议。

我更改了您的代码中的一些内容:

Python 3.x:

from bs4 import BeautifulSoup
import urllib.request



url = "https://en.wikipedia.org/wiki/Vicia_faba"
print(url)

with urllib.request.urlopen(url) as url:
    source = url.read()

soup = BeautifulSoup(source, 'lxml')
# print(soup)
# summary = soup.find('<p>').getText()
# print(summary)

for para_tag in soup.find_all('p'):
    print (para_tag.text)

输出:

Faba sativa Moench.

Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]

Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.

Vicia faba is a stiffly erect plant 0.5 to 1.8 metres (1.6 to 5.9 ft) tall, with stems that are square in cross-section. The leaves are 10 to 25 centimetres (3.9 to 9.8 in) long, pinnate with 2–7 leaflets, and colored a distinct glaucous (Latin: glaucus) grey-green color. Unlike most other vetches, the leaves do not have tendrils for climbing over other vegetation.

The flowers are 1 to 2.5 centimetres (0.39 to 0.98 in) long with five petals; the standard petals are white, the wing petals are white with a black spot (true black, not deep purple or blue as is the case in many "black" colorings)[3] and the keel petals are white. Crimson-flowered broad beans also exist, which were recently saved from extinction.[4] The flowers have a strong sweet scent which is attractive to bees and other pollinators.[5]

继续......

编辑:

你要看懂那篇文章的写法,先抓外面的-div,再抓里面的标签。

类似于:

container = soup.find("div",attrs={'class': 'mw-parser-output'})

paragraph = container.find("p")

for p in container.find_all("p"):
    if 'Vicia faba, ' in p.text or 'Some people suffer ' in p.text:
        print (p.text)

输出:

Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]

Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.