捕获 python 中特定标记之间的数据
Capturing data between specific tag in python
我正在 python 中获取 url 内容...我想捕获 <h1>
和 </h1>
之间的所有内容。
我试过的是:
myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
startString='<h1>'
endString='</h1>'
print myString[myString.find(startString)+len(startString):myString.find(endString)]
我有多个 h1
标签。但它捕获第一个 h1 标签之间的数据。
如何捕获所有 h1
标签之间的数据?
您可以使用简单的 regular expression:
import re
print re.findall(r'<h1>(.*?)</h1>', myString)
另一种方法是使用 Beautiful Soup 作为 HTML 解析器(如果您想解析现实世界的 HTML 页,这是更优选的方法):
from bs4 import BeautifulSoup
soup = BeautifulSoup(myString)
print [h1.string for h1 in soup.find_all('h1')]
BeautifulSoup 未包含在标准库中,因此您需要手动安装。您可以通过 pip 轻松安装它:
pip install beautifulsoup4
使用 BeautifulSoup 解析器。
>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
print i.text
kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>>
Beautiful Soup 的工作示例
>>> import bs4
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
... <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
... dsfgdfgg
... <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
... dfgdffdgf
... <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
... dfgdfgdg
... <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
... '''
>>> soup = bs4.BeautifulSoup(myString)
>>> soup.find("h1").text
u'kgkgjgjgkjgkjgkj'
>>> soup.find_all("h1")
[<h1>kgkgjgjgkjgkjgkj</h1>, <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>, <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>, <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>, <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>]
简单列表补偿解决方案:
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
我会去 Beautifulsoup-- 我的尝试
from bs4 import BeautifulSoup
import requests
url = 'http://accessibility.psu.edu/headingshtml/'
respons = requests.get(url).content
soup = BeautifulSoup(respons,'lxml')
h1tags = soup.find_all('h1')
for singleTag in h1tags:
print singleTag.text
打印(在本例中只有一个 h1 标签)
Heading Tags (H1, H2, H3, P) in HTML
我正在 python 中获取 url 内容...我想捕获 <h1>
和 </h1>
之间的所有内容。
我试过的是:
myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
startString='<h1>'
endString='</h1>'
print myString[myString.find(startString)+len(startString):myString.find(endString)]
我有多个 h1
标签。但它捕获第一个 h1 标签之间的数据。
如何捕获所有 h1
标签之间的数据?
您可以使用简单的 regular expression:
import re
print re.findall(r'<h1>(.*?)</h1>', myString)
另一种方法是使用 Beautiful Soup 作为 HTML 解析器(如果您想解析现实世界的 HTML 页,这是更优选的方法):
from bs4 import BeautifulSoup
soup = BeautifulSoup(myString)
print [h1.string for h1 in soup.find_all('h1')]
BeautifulSoup 未包含在标准库中,因此您需要手动安装。您可以通过 pip 轻松安装它:
pip install beautifulsoup4
使用 BeautifulSoup 解析器。
>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
print i.text
kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>>
Beautiful Soup 的工作示例
>>> import bs4
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
... <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
... dsfgdfgg
... <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
... dfgdffdgf
... <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
... dfgdfgdg
... <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
... '''
>>> soup = bs4.BeautifulSoup(myString)
>>> soup.find("h1").text
u'kgkgjgjgkjgkjgkj'
>>> soup.find_all("h1")
[<h1>kgkgjgjgkjgkjgkj</h1>, <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>, <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>, <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>, <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>]
简单列表补偿解决方案:
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
我会去 Beautifulsoup-- 我的尝试
from bs4 import BeautifulSoup
import requests
url = 'http://accessibility.psu.edu/headingshtml/'
respons = requests.get(url).content
soup = BeautifulSoup(respons,'lxml')
h1tags = soup.find_all('h1')
for singleTag in h1tags:
print singleTag.text
打印(在本例中只有一个 h1 标签)
Heading Tags (H1, H2, H3, P) in HTML