如何使用 lxml cssselctor 从 <a> 元素中提取 href?
how to extract href from <a> element using lxml cssselctor?
def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')
for item in item_sel(tree):
yield {'href': a_tag(item)[0].text_content(),
'my pagetext': text_sel(item)[0].text_content(),
'time': time_sel(item)[0].text_content().strip(),
'author': author_sel(item)[0].text_content()}
我想提取 href
但我无法使用此代码提取它
尝试将 @href
提取为
'href': a_tag(item)[0].attrib['href']
或
'href': a_tag(item)[0].get('href')
您也可以选择使用 XPath
tree.xpath(".//a/@href")
def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')
for item in item_sel(tree):
yield {'href': a_tag(item)[0].text_content(),
'my pagetext': text_sel(item)[0].text_content(),
'time': time_sel(item)[0].text_content().strip(),
'author': author_sel(item)[0].text_content()}
我想提取 href
但我无法使用此代码提取它
尝试将 @href
提取为
'href': a_tag(item)[0].attrib['href']
或
'href': a_tag(item)[0].get('href')
您也可以选择使用 XPath
tree.xpath(".//a/@href")