如何使用 find_all() 来抓取字典网站?

How to use find_all() to scrape dictionary website?

所以我遇到了一个奇怪的问题。我正在使用 BeautifulSoup 来抓取字典网站以获取定义及其词性,并且必须以正确的顺序抓取它们,以便正确的词性与正确的定义相匹配。

例如,对于 'ape',定义 'A large primate' 必须与名词一起使用,而 'mimic' 必须与动词一起使用。对于 Merriam Webster 的网站,我使用了:

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.merriam-webster.com/dictionary/'
word = 'ape'

results = requests.get(url + word)
src = results.content
soup = bs(src, 'lxml')

text = soup.find_all(class_= ['num', 'letter', 'dtText', 'sdsense', 'important-blue-link'])

for tag in text:
    print(tag.text.strip())

效果很好。对于每个 div 和 class = 'num', 'letter', 等等...它去除了正确的元素然后 print(tag.text.strip()) 返回里面的文本。

不幸的是,MW 的格式是一场噩梦(注意 class 标签比词性和定义要多得多)而且定义比我要找的要冗长,所以我去了 dictionary.com. Dictionary.com 有更简单的格式和更好的定义来满足我的目的,所以我很高兴。当我尝试将多个 class 传递给 find_all 函数时,就会出现问题。如果我 运行:

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.dictionary.com/browse/'
word = 'ape'

results = requests.get(url + word, verify = False)
src = results.content
soup = bs(src, 'lxml')

text = soup.find_all(class_ = 'one-click-content css-nnyc96 e1q3nk1v1')

for tag in text:
    print(tag.text.strip())

我的所有定义都很好,如果我 运行 与

相同的代码
text = soup.find_all(class_ = 'luna-pos')

我的所有词性都很好,但是如果我 运行 代码

text = soup.find_all(class_ = ['luna-pos','one-click-content css-nnyc96 e1q3nk1v1'])

它 returns 文本变量只是一个空列表。我不明白为什么这种在 find_all() 函数中输入多个标签的格式适用于一个网站,但不适用于另一个网站。我唯一能想到的是 requests.get() 没有找到 dictionary.com 的证书,所以我输入了 verify = False 并且它 returns 有点警告,但我想不出为什么会影响 find_all() 功能。

不确定为什么需要将这两个部分结合起来,但您可以通过以下方式实现目标:

soup.find_all(class_ = ['luna-pos','one-click-content'])

soup.select('.luna-pos,.one-click-content')

以防万一 - 获得分离且结构化程度更高的输出,您应该更改选择元素的策略:

data = []
for e in soup.select('#top-definitions-section ~ section'):
    data.append({
        'pos':e.select_one('.luna-pos').text,
        'definition':[t.get_text(strip=True) for t in e.select('div[value]')]
    })

data

输出:

[{'pos': 'noun',
  'definition': ['Anthropology,Zoology.any member of the superfamily Hominoidea, the two extant branches of which are the lesser apes (gibbons) and the great apes (humans, chimpanzees, gorillas, and orangutans).See alsocatarrhine.',
   '(loosely) any primate except humans.',
   'an imitator;mimic.',
   'Informal.a big, ugly, clumsy person.',
   'Disparaging and Offensive.(used as a slur against a member of a racial or ethnic minority group, especially a Black person.)']},
 {'pos': 'verb (used with object),',
  'definition': ["toimitate;mimic:to ape another's style of writing."]},
 {'pos': 'adjective',
  'definition': ['Slang. (usually in the phrasego ape)violently emotional:When she threatened to leave him, he went ape.extremely enthusiastic (often followed byoverorfor):They go ape over old rock music.We were all ape for the new movie trailer.']}]