从 Html 页面获取数据到 Python 数组
Get data from an Html page into Python array
我只是想从这样的网页获取一些数据:
[ . . . ]
<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>
[ . . . ]
我想要一个 python 数组,如下所示:
myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"]
这是我的 python 脚本:
import urllib.request
urlAddress = "http:// ... /" # my url address
getPage = urllib.request.urlopen(urlAddress)
outputPage = getPage.read()
print(outputPage)
如何从 "outputPage" 获取数组?
这似乎符合您的要求:
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> html = '''<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>'''
>>> import re
>>> re.findall('<p class="special-large">([^<]+)</p>', html)
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05']
>>>
请注意 regular expressions are typically not preferred for something like this. You should use a library like Beautiful Soup。
我只是想从这样的网页获取一些数据:
[ . . . ]
<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>
[ . . . ]
我想要一个 python 数组,如下所示:
myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"]
这是我的 python 脚本:
import urllib.request
urlAddress = "http:// ... /" # my url address
getPage = urllib.request.urlopen(urlAddress)
outputPage = getPage.read()
print(outputPage)
如何从 "outputPage" 获取数组?
这似乎符合您的要求:
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> html = '''<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>'''
>>> import re
>>> re.findall('<p class="special-large">([^<]+)</p>', html)
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05']
>>>
请注意 regular expressions are typically not preferred for something like this. You should use a library like Beautiful Soup。