Python 未在 html 标签之间获取文本
Python not getting text between html tags
看起来 python 在标记为 display=none 时找不到文本,我应该怎么做才能解决这个问题?
这是我的代码
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/')
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id':'all-domains'})
data.text
代码returns[]
我也尝试过 xpath:
from lxml import etree
data = etree.HTML(r.text)
anchor = data.xpath('//div[@id="all-domains"]/text()')
它returns同样的东西...
是的,带有id="all-domains"
的元素是空的,因为它是由浏览器中执行的javascript动态设置的。使用 requests
,您只能获得没有 "dynamic" 部分的初始 HTML 页面,可以这么说。要获取所有域,我只需遍历 table 行并提取域 link 文本。工作样本:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/',
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"})
soup = BeautifulSoup(r.text, 'html.parser')
for domain in soup.select("tbody#domcop-table-body tr td a.domain-link"):
print(domain.get_text())
打印:
u2tourfans.com
tvadsview.com
gfanatic.com
blucigs.com
...
twply.com
sweethomeparis.com
vvchart.com
看起来 python 在标记为 display=none 时找不到文本,我应该怎么做才能解决这个问题?
这是我的代码
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/')
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id':'all-domains'})
data.text
代码returns[]
我也尝试过 xpath:
from lxml import etree
data = etree.HTML(r.text)
anchor = data.xpath('//div[@id="all-domains"]/text()')
它returns同样的东西...
是的,带有id="all-domains"
的元素是空的,因为它是由浏览器中执行的javascript动态设置的。使用 requests
,您只能获得没有 "dynamic" 部分的初始 HTML 页面,可以这么说。要获取所有域,我只需遍历 table 行并提取域 link 文本。工作样本:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/',
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"})
soup = BeautifulSoup(r.text, 'html.parser')
for domain in soup.select("tbody#domcop-table-body tr td a.domain-link"):
print(domain.get_text())
打印:
u2tourfans.com
tvadsview.com
gfanatic.com
blucigs.com
...
twply.com
sweethomeparis.com
vvchart.com