Selenium python:从 <div> 中获取所有 <ul> 的所有 <li> 文本

Selenium python: get all the <li> text of all the <ul> from a <div>

我想从几个页面中获取所有 dutch word = english word 的单词列表。

通过检查 HTML,这意味着我需要从 div 的子 div 中获取所有 ul 的所有 li 的所有文本=19=].

这是我的代码:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)

listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

这是输出

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

我不明白为什么某些 li 文本即使它们的 xpath 相同也无法检索(我通过开发者控制台的复制 xpath 仔细检查了其中的几个)

尝试等待页面完全加载后再解析,一种方法是使用time.sleep()方法:

from time import sleep
...

for url in listURL:
    driver.get(url)
    sleep(5)
    ...

编辑:使用 BeautifulSoup:

import requests
from bs4 import BeautifulSoup


listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    
    for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
        print(tag.text)
        print()
        print(tag.find_next("ul").text)
        print("-" * 80)
    print()

输出(截断):

Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1

man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
--------------------------------------------------------------------------------
Lesson 2

meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
--------------------------------------------------------------------------------
Lesson 3

appel = apple

... And on

如果您希望输出为 list:

for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
    print("-" * 80)

您的脚本似乎没问题,但我会添加显式或隐式等待。 尝试等到页面上的所有元素都可见:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

此外,您可以在声明 driver 后立即添加 driver.implicitly_wait(15)

输出:

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']

更新: 我找到了一种使用 CSS 选择器的更可靠的方法。请尝试一下:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
driver.implicitly_wait(15)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))
    elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_css_selector("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

更新 2 在尝试了解原因后,我发现广告占用了大部分加载时间。所以我要添加 wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] "))) 等待所有广告加载完毕。

我还通过删除最后一个 li 将第二个等待更改为 .mw-parser-output>ul。我觉得没有必要。您也可以尝试删除第二个等待,看看是否有帮助。

之后

WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))

你需要补充一些睡眠,我想 time.sleep(1) 就足够了,然后再做

elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')

您的问题是由于对 visibility_of_all_elements_located 功能的误解造成的。
它实际上并没有等待您传递给它的定位器所定位的所有元素变得可见,它不知道要等待多少这样的元素。
因此,一旦它检测到至少 1 个与您的定位器可见的元素相匹配 - 它 returns 检测到的元素列表并且程序继续前进。
请参阅有关这些方法的更多详细信息 和官方文档。