如何以与原始网页相同的顺序显示标题 header 和段落? - Python
How can I show title header and paragraph in the same order as the original web page? - Python
我正在解析维基百科网页。
我想搜索一个关键字,例如 "The first abstraction" 并显示标题、header 和匹配的段落,我该怎么做?
Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
tittle: Mathematics
header: History
paragraph : The history of mathematics can be seen as an ever-increasing series of
abstractions. **The first abstraction**, which is shared by many animals,[14] was
probably that of numbers: the realization that a collection of two apples and a
collection of two oranges (for example) have something in common, namely quantity
of their members.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)
这段代码很好地显示了标题,但是 header 和段落没有排序,所以我无法匹配它。
谢谢
首先,当您循环浏览
标签时,您需要搜索 "The first abstraction",因为您只需要包含 "The first abstraction".
的段落
所以在你的 'para' 上添加一个 find() 方法来检查是否存在预期的文本 -
paragraphs = html.select("p")
Search = "The first abstraction" # expected text
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
这将为您提供预期的段落 -
The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.
现在 段落 和 标题 完成了。您需要提取 header。
关注您要解析的页面的 html 文件结构(这总是有帮助的)。
在下图中,h2 是 p 标签(您的文本所在的位置)的同级标签。阅读有关兄弟姐妹的信息 here。
所以要遍历到前一个兄弟,你应该在 p 标签上调用 'previous_sibling' 两次。
由于 h2 是 p 之前的同级标签 2,您可以访问 h2 (其中有 'History' header)作为 -
paragraphs = html.select("p")
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)
这将打印 -
History
我正在解析维基百科网页。 我想搜索一个关键字,例如 "The first abstraction" 并显示标题、header 和匹配的段落,我该怎么做?
Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
tittle: Mathematics
header: History
paragraph : The history of mathematics can be seen as an ever-increasing series of
abstractions. **The first abstraction**, which is shared by many animals,[14] was
probably that of numbers: the realization that a collection of two apples and a
collection of two oranges (for example) have something in common, namely quantity
of their members.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)
这段代码很好地显示了标题,但是 header 和段落没有排序,所以我无法匹配它。 谢谢
首先,当您循环浏览
标签时,您需要搜索 "The first abstraction",因为您只需要包含 "The first abstraction".
的段落所以在你的 'para' 上添加一个 find() 方法来检查是否存在预期的文本 -
paragraphs = html.select("p")
Search = "The first abstraction" # expected text
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
这将为您提供预期的段落 -
The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.
现在 段落 和 标题 完成了。您需要提取 header。 关注您要解析的页面的 html 文件结构(这总是有帮助的)。
在下图中,h2 是 p 标签(您的文本所在的位置)的同级标签。阅读有关兄弟姐妹的信息 here。
所以要遍历到前一个兄弟,你应该在 p 标签上调用 'previous_sibling' 两次。
由于 h2 是 p 之前的同级标签 2,您可以访问 h2 (其中有 'History' header)作为 -
paragraphs = html.select("p")
for para in paragraphs:
px = para.text
if px.find(Search)>-1:
print (para.text)
print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)
这将打印 -
History