返回特定内容
Returning specific content
我只需要 IP 地址。如何报废。我现在的代码-
import urllib
import urllib.request
from bs4 import BeautifulSoup
x = urllib.request.urlopen('http://bannedhackersips.blogspot.com/2014_08_04_archive.html')
soup = BeautifulSoup(x,"html.parser")
data = soup.find_all("ul", {"class": "posts"})
for content in data:
print(content.text)
输出:
[Fail2Ban] SSH: banned 116.10.191.162
[Fail2Ban] SSH: banned 116.10.191.204
[Fail2Ban] SSH: banned 61.174.51.232
[Fail2Ban] SSH: banned 61.174.51.224
[Fail2Ban] SSH: banned 116.10.191.225
[Fail2Ban] SSH: banned 200.162.47.130
[Fail2Ban] SSH: banned 116.10.191.175
[Fail2Ban] SSH: banned 61.174.51.223
[Fail2Ban] SSH: banned 61.174.51.234
[Fail2Ban] SSH: banned 61.174.51.209
[Fail2Ban] SSH: banned 116.10.191.165
[Fail2Ban] SSH: banned 106.240.247.220
您可以使用正则表达式从文本中提取:
data = soup.find("ul", {"class": "posts"})
import re
r = re.compile("\d+\.\d+\.\d+\.\d+")
print(r.findall(data.text))
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']
或者随着模式的重复,您可以使用分割线分割成子串,并从每个子串的末尾分割一次以提取 ip:
data = soup.find("ul", {"class": "posts"})
ips = [line.rsplit(None, 1)[1] for line in data.text.splitlines() if line]
print(ips)
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']
页面上只有一个 posts
class 所以 find 就足够了,当你遍历 find_all
你实际上是在遍历单个元素列表。
我只需要 IP 地址。如何报废。我现在的代码-
import urllib
import urllib.request
from bs4 import BeautifulSoup
x = urllib.request.urlopen('http://bannedhackersips.blogspot.com/2014_08_04_archive.html')
soup = BeautifulSoup(x,"html.parser")
data = soup.find_all("ul", {"class": "posts"})
for content in data:
print(content.text)
输出:
[Fail2Ban] SSH: banned 116.10.191.162
[Fail2Ban] SSH: banned 116.10.191.204
[Fail2Ban] SSH: banned 61.174.51.232
[Fail2Ban] SSH: banned 61.174.51.224
[Fail2Ban] SSH: banned 116.10.191.225
[Fail2Ban] SSH: banned 200.162.47.130
[Fail2Ban] SSH: banned 116.10.191.175
[Fail2Ban] SSH: banned 61.174.51.223
[Fail2Ban] SSH: banned 61.174.51.234
[Fail2Ban] SSH: banned 61.174.51.209
[Fail2Ban] SSH: banned 116.10.191.165
[Fail2Ban] SSH: banned 106.240.247.220
您可以使用正则表达式从文本中提取:
data = soup.find("ul", {"class": "posts"})
import re
r = re.compile("\d+\.\d+\.\d+\.\d+")
print(r.findall(data.text))
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']
或者随着模式的重复,您可以使用分割线分割成子串,并从每个子串的末尾分割一次以提取 ip:
data = soup.find("ul", {"class": "posts"})
ips = [line.rsplit(None, 1)[1] for line in data.text.splitlines() if line]
print(ips)
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']
页面上只有一个 posts
class 所以 find 就足够了,当你遍历 find_all
你实际上是在遍历单个元素列表。