使用 BeautifulSoup 解析来自维基百科的页面时出现问题
Problem parsing page from Wikipedia with BeautifulSoup
我有一个非常简单的测试脚本,用于从维基百科获取文章并获取出现在页面中的文本的第一段(即 summary
)。
这里是:
from bs4 import BeautifulSoup
import urllib2
url = "https://en.wikipedia.org/wiki/Vicia_faba"
print url
source = urllib2.urlopen(url)
soup = BeautifulSoup(source, 'lxml')
print soup
summary = soup.find('p').getText()
print summary
解析 summary
时我什么也没得到,尽管页面已成功获取并正确传递给 BeautifulSoup
。
这看起来是一个很简单的问题,但我无法进一步深入。 BeautifulSoup
充满了技巧,但不幸的是我不知道其中的许多技巧!
提前感谢您的任何提示或建议。
我更改了您的代码中的一些内容:
Python 3.x:
from bs4 import BeautifulSoup
import urllib.request
url = "https://en.wikipedia.org/wiki/Vicia_faba"
print(url)
with urllib.request.urlopen(url) as url:
source = url.read()
soup = BeautifulSoup(source, 'lxml')
# print(soup)
# summary = soup.find('<p>').getText()
# print(summary)
for para_tag in soup.find_all('p'):
print (para_tag.text)
输出:
Faba sativa Moench.
Vicia faba, also known in the culinary sense as the broad bean, fava
bean, or faba bean is a species of flowering plant in the pea and bean
family Fabaceae. It is of uncertain origin[1]:160 and widely
cultivated as a crop for human consumption. It is also used as a cover
crop, the bell bean, which has smaller beans. Varieties with smaller,
harder seeds that are fed to horses or other animals are called field
bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers.,
is a variety recognized as an accepted name.[2]
Some people suffer from favism, a hemolytic response to the
consumption of broad beans, a condition linked to G6PDD. Otherwise the
beans, with the outer seed coat removed, can be eaten raw or cooked.
In young plants, the outer seed coat can be eaten, and in very young
plants, the seed pod can be eaten.
Vicia faba is a stiffly erect plant 0.5 to 1.8 metres (1.6 to 5.9 ft)
tall, with stems that are square in cross-section. The leaves are 10
to 25 centimetres (3.9 to 9.8 in) long, pinnate with 2–7 leaflets, and
colored a distinct glaucous (Latin: glaucus) grey-green color. Unlike
most other vetches, the leaves do not have tendrils for climbing over
other vegetation.
The flowers are 1 to 2.5 centimetres (0.39 to 0.98 in) long with five
petals; the standard petals are white, the wing petals are white with
a black spot (true black, not deep purple or blue as is the case in
many "black" colorings)[3] and the keel petals are white.
Crimson-flowered broad beans also exist, which were recently saved
from extinction.[4] The flowers have a strong sweet scent which is
attractive to bees and other pollinators.[5]
继续......
编辑:
你要看懂那篇文章的写法,先抓外面的-div,再抓里面的标签。
类似于:
container = soup.find("div",attrs={'class': 'mw-parser-output'})
paragraph = container.find("p")
for p in container.find_all("p"):
if 'Vicia faba, ' in p.text or 'Some people suffer ' in p.text:
print (p.text)
输出:
Vicia faba, also known in the culinary sense as the broad bean, fava
bean, or faba bean is a species of flowering plant in the pea and bean
family Fabaceae. It is of uncertain origin[1]:160 and widely
cultivated as a crop for human consumption. It is also used as a cover
crop, the bell bean, which has smaller beans. Varieties with smaller,
harder seeds that are fed to horses or other animals are called field
bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers.,
is a variety recognized as an accepted name.[2]
Some people suffer from favism, a hemolytic response to the
consumption of broad beans, a condition linked to G6PDD. Otherwise the
beans, with the outer seed coat removed, can be eaten raw or cooked.
In young plants, the outer seed coat can be eaten, and in very young
plants, the seed pod can be eaten.
我有一个非常简单的测试脚本,用于从维基百科获取文章并获取出现在页面中的文本的第一段(即 summary
)。
这里是:
from bs4 import BeautifulSoup
import urllib2
url = "https://en.wikipedia.org/wiki/Vicia_faba"
print url
source = urllib2.urlopen(url)
soup = BeautifulSoup(source, 'lxml')
print soup
summary = soup.find('p').getText()
print summary
解析 summary
时我什么也没得到,尽管页面已成功获取并正确传递给 BeautifulSoup
。
这看起来是一个很简单的问题,但我无法进一步深入。 BeautifulSoup
充满了技巧,但不幸的是我不知道其中的许多技巧!
提前感谢您的任何提示或建议。
我更改了您的代码中的一些内容:
Python 3.x:
from bs4 import BeautifulSoup
import urllib.request
url = "https://en.wikipedia.org/wiki/Vicia_faba"
print(url)
with urllib.request.urlopen(url) as url:
source = url.read()
soup = BeautifulSoup(source, 'lxml')
# print(soup)
# summary = soup.find('<p>').getText()
# print(summary)
for para_tag in soup.find_all('p'):
print (para_tag.text)
输出:
Faba sativa Moench.
Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]
Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.
Vicia faba is a stiffly erect plant 0.5 to 1.8 metres (1.6 to 5.9 ft) tall, with stems that are square in cross-section. The leaves are 10 to 25 centimetres (3.9 to 9.8 in) long, pinnate with 2–7 leaflets, and colored a distinct glaucous (Latin: glaucus) grey-green color. Unlike most other vetches, the leaves do not have tendrils for climbing over other vegetation.
The flowers are 1 to 2.5 centimetres (0.39 to 0.98 in) long with five petals; the standard petals are white, the wing petals are white with a black spot (true black, not deep purple or blue as is the case in many "black" colorings)[3] and the keel petals are white. Crimson-flowered broad beans also exist, which were recently saved from extinction.[4] The flowers have a strong sweet scent which is attractive to bees and other pollinators.[5]
继续......
编辑:
你要看懂那篇文章的写法,先抓外面的-div,再抓里面的标签。
类似于:
container = soup.find("div",attrs={'class': 'mw-parser-output'})
paragraph = container.find("p")
for p in container.find_all("p"):
if 'Vicia faba, ' in p.text or 'Some people suffer ' in p.text:
print (p.text)
输出:
Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean is a species of flowering plant in the pea and bean family Fabaceae. It is of uncertain origin[1]:160 and widely cultivated as a crop for human consumption. It is also used as a cover crop, the bell bean, which has smaller beans. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name.[2]
Some people suffer from favism, a hemolytic response to the consumption of broad beans, a condition linked to G6PDD. Otherwise the beans, with the outer seed coat removed, can be eaten raw or cooked. In young plants, the outer seed coat can be eaten, and in very young plants, the seed pod can be eaten.