BeautifulSoup 获取 < > 标签的内容
BeautifulSoup getting content of < > tags
我有一组必须使用的抓取页面(不能再次抓取这些),其中包含引用 < >
标签中的元信息,如下所示:
...
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
<meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" />
...
更新 3:
Chrome 中加载的这些行如下所示:
<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />
但原始抓取的文本而不是 <meta>
标签具有 <meta .... >meta
是否可以使用 BeautifulSoup 从 <meta .... >meta
标签中获取内容? 就像在这种情况下,我需要获取 "Smart TV wifi won't turn on" 和 url "https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on"
如何做到这一点?
from bs4 import BeautifulSoup
html = """ ...
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'html-attribute-value'})[2]:
print(item)
更新:
from bs4 import BeautifulSoup
import re
html = """<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("meta", property=re.compile("^og")):
print(item.get("content"))
输出:
Smart TV wifi won't turn on
https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on
不知道这是不是你想要的。
from simplified_scrapy import SimplifiedDoc
html = '''
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
<meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" />
'''
doc = SimplifiedDoc(html)
block = doc.getSectionByReg('<meta[\s\S]+?/>') # Get the first data block.
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)
blocks = doc.getSectionsByReg('<meta[\s\S]+?/>') # Get all data blocks
for block in blocks:
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)
结果:
Smart TV wifi won't turn on
Smart TV wifi won't turn on
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on
我有一组必须使用的抓取页面(不能再次抓取这些),其中包含引用 < >
标签中的元信息,如下所示:
...
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
<meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" />
...
更新 3:
Chrome 中加载的这些行如下所示:
<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />
但原始抓取的文本而不是 <meta>
标签具有 <meta .... >meta
是否可以使用 BeautifulSoup 从 <meta .... >meta
标签中获取内容? 就像在这种情况下,我需要获取 "Smart TV wifi won't turn on" 和 url "https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on"
如何做到这一点?
from bs4 import BeautifulSoup
html = """ ...
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'html-attribute-value'})[2]:
print(item)
更新:
from bs4 import BeautifulSoup
import re
html = """<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("meta", property=re.compile("^og")):
print(item.get("content"))
输出:
Smart TV wifi won't turn on
https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on
不知道这是不是你想要的。
from simplified_scrapy import SimplifiedDoc
html = '''
<span class="html-tag">
<meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>"
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" />
...
<meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>"
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" />
'''
doc = SimplifiedDoc(html)
block = doc.getSectionByReg('<meta[\s\S]+?/>') # Get the first data block.
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)
blocks = doc.getSectionsByReg('<meta[\s\S]+?/>') # Get all data blocks
for block in blocks:
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)
结果:
Smart TV wifi won't turn on
Smart TV wifi won't turn on
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on