Python3 从网页读取文本文件
Reading Text File From Webpage by Python3
import re
import urllib
hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
qq=hand.read().decode('utf-8')
numlist=[]
for line in qq:
line.rstrip()
stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",line)
if len(stuff)!=1:
continue
num=float(stuff[0])
numlist.append(num)
print('Maximum:',max(numlist))
变量qq
包含文本文件中的所有字符串。但是,for
循环不起作用并且 numlist
仍然是空的。
当我将文本文件下载为本地文件然后读取时,一切正常。
使用多行标志 re.M
在 qq 上使用正则表达式,您正在遍历一个字符串,因此 逐个字符 ,而不是 逐行line 所以你在单个字符上调用 findall:
In [18]: re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)
Out [18]: ['0.8475', '0.6178', '0.6961', '0.7565', '0.7626', '0.7556', '0.7002', '0.7615', '0.7601', '0.7605', '0.6959', '0.7606', '0.7559', '0.7605', '0.6932', '0.7558', '0.6526', '0.6948', '0.6528', '0.7002', '0.7554', '0.6956', '0.6959', '0.7556', '0.9846', '0.8509', '0.9907']
你所做的相当于:
In [13]: s = "foo\nbar"
In [14]: for c in s:
....: stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",c)
print(c)
....:
f
o
o
b
a
r
如果你想要浮动,你可以使用 map
:
list(map(float,re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)))
但是如果你只想要最大值,你可以传递一个键给 max
:
In [22]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[22]: '0.9907'
所以你只需要三行:
In [28]: hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
In [29]: qq = hand.read().decode('utf-8')
In [30]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[30]: '0.9907'
如果你想逐行进行,直接遍历 hand
:
import re
import urllib
hand = urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
numlist = []
# iterate over each line like a file object
for line in hand:
stuff = re.search("^X-DSPAM-Confidence: ([0-9.]+)", line.decode("utf-8"))
if stuff:
numlist.append(float(stuff.group(1)))
print('Maximum:', max(numlist))
import re
import urllib
hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
qq=hand.read().decode('utf-8')
numlist=[]
for line in qq:
line.rstrip()
stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",line)
if len(stuff)!=1:
continue
num=float(stuff[0])
numlist.append(num)
print('Maximum:',max(numlist))
变量qq
包含文本文件中的所有字符串。但是,for
循环不起作用并且 numlist
仍然是空的。
当我将文本文件下载为本地文件然后读取时,一切正常。
使用多行标志 re.M
在 qq 上使用正则表达式,您正在遍历一个字符串,因此 逐个字符 ,而不是 逐行line 所以你在单个字符上调用 findall:
In [18]: re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)
Out [18]: ['0.8475', '0.6178', '0.6961', '0.7565', '0.7626', '0.7556', '0.7002', '0.7615', '0.7601', '0.7605', '0.6959', '0.7606', '0.7559', '0.7605', '0.6932', '0.7558', '0.6526', '0.6948', '0.6528', '0.7002', '0.7554', '0.6956', '0.6959', '0.7556', '0.9846', '0.8509', '0.9907']
你所做的相当于:
In [13]: s = "foo\nbar"
In [14]: for c in s:
....: stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",c)
print(c)
....:
f
o
o
b
a
r
如果你想要浮动,你可以使用 map
:
list(map(float,re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)))
但是如果你只想要最大值,你可以传递一个键给 max
:
In [22]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[22]: '0.9907'
所以你只需要三行:
In [28]: hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
In [29]: qq = hand.read().decode('utf-8')
In [30]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[30]: '0.9907'
如果你想逐行进行,直接遍历 hand
:
import re
import urllib
hand = urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
numlist = []
# iterate over each line like a file object
for line in hand:
stuff = re.search("^X-DSPAM-Confidence: ([0-9.]+)", line.decode("utf-8"))
if stuff:
numlist.append(float(stuff.group(1)))
print('Maximum:', max(numlist))