如何以与原始网页相同的顺序显示标题 header 和段落? - Python

How can I show title header and paragraph in the same order as the original web page? - Python

我正在解析维基百科网页。 我想搜索一个关键字,例如 "The first abstraction" 并显示标题、header 和匹配的段落,我该怎么做?

Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
       tittle: Mathematics
       header: History
       paragraph : The history of mathematics can be seen as an ever-increasing series of   
                   abstractions. **The first abstraction**, which is shared by many animals,[14] was 
                   probably that of numbers: the realization that a collection of two apples and a            
                   collection of two oranges (for example) have something in common, namely quantity 
                   of their members. 
import bs4
import requests

response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')

title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
    print (para.text)

# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)

这段代码很好地显示了标题,但是 header 和段落没有排序,所以我无法匹配它。 谢谢

首先,当您循环浏览

标签时,您需要搜索 "The first abstraction",因为您只需要包含 "The first abstraction".

的段落

所以在你的 'para' 上添加一个 find() 方法来检查是否存在预期的文本 -

paragraphs = html.select("p")

Search = "The first abstraction" # expected text

for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)

这将为您提供预期的段落 -

The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.

现在 段落标题 完成了。您需要提取 header。 关注您要解析的页面的 html 文件结构(这总是有帮助的)。

在下图中,h2p 标签(您的文本所在的位置)的同级标签。阅读有关兄弟姐妹的信息 here

所以要遍历到前一个兄弟,你应该在 p 标签上调用 'previous_sibling' 两次。

由于 h2p 之前的同级标签 2,您可以访问 h2 (其中有 'History' header)作为 -

paragraphs = html.select("p")
for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)
        print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)

这将打印 -

History