我正在尝试合并 2 个数组。我正在尝试以(ip,端口)的方式安排它。我怎样才能按照我想要的方式安排它?
I am trying to merge 2 arrays. I am trying to arrange it in the way (ip,port). How can I arrange it in the way that I want?
标题。我要指出的是,该项目从一个免费代理网站解析 IP、端口及其类型(https 或不),然后在 linux 上进行测试以确定它们是否有效。它将那些保存在元组中并将它们写入 csv。
import requests
import lxml
from bs4 import BeautifulSoup
import csv
names = []
url = 'https://free-proxy-list.net/'
page = requests.get(url)
soup = BeautifulSoup(page.content, features='lxml')
headers = soup.find_all('th')
headers_refined = []
headers_refined.append(headers[0])
headers_refined.append(headers[1])
headers_refined.append(headers[6])
ips = soup.find_all('td')
ips = ips[::8]
ports = soup.find_all('td')
ports = ports[1::8]
element_index = 0
for i in ips:
ips[element_index] = str(ips[element_index])
element_index += 1
element_index = 0
for i in headers_refined:
headers_refined[element_index] = str(headers_refined[element_index])
element_index += 1
element_index = 0
for i in ports:
ports[element_index] = str(ports[element_index])
element_index += 1
ips = ' '.join(ips).replace('<td>', '').split()
ips = ' '.join(ips).replace('</td>', '').split()
ips = ips[:-43:]
headers_refined = ' '.join(headers_refined).replace('<th>', '').split()
headers_refined = ' '.join(headers_refined).replace('</th>', '').split()
headers_refined = ' '.join(headers_refined).replace('<th class="hx">', '').split()
ports = ' '.join(ports).replace('<td>', '').split()
ports = ' '.join(ports).replace('</td>', '').split()
while len(ports)>len(ips):
ports=ports[:-1:]
prev_len_ips=len(ips)
index=0
for i in range(prev_len_ips):
ips.insert(i+1,ports[i])
# print(headers_refined)
# print(ips)
# print(ports)
print(prev_len_ips)
print(len(ports))
print(ips)
ips = [*zip(ips[::2])]
with open('ips.csv', '+w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(ips)
上面的代码按如下顺序打印出列表:
['IP','port','port','port','port',...]
直到用完所有可用端口。之后,它会打印列表中剩余的 IP。
P.S。我很乐意接受关于改进和优化我的代码以使其看起来更好的任何其他建议。提前致谢!
要从该页面获得所需内容,有更简单的方法可以实现。由于您已经在使用 lxml
作为解析器,这正是您所需要的:
from urllib.request import urlopen, Request
from lxml import etree
# free-proxy-list.net doesn't like Python announcing itself, use at your own risk
req = Request(
'https://free-proxy-list.net/',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
# reading the contents of the page, getting the part you need
with urlopen(req) as f:
root = etree.parse(f, parser=etree.HTMLParser())
# get the proxies from the only textarea on the page, skip the description and timestamp
proxies = root.xpath('*//textarea/text()')[0].split('\n')[3:]
# the format you want
proxies = [tuple(proxy.split(':')) for proxy in proxies]
print(proxies)
在 lxml
之外没有外部依赖(没有 bs4
或 requests
),只有几行代码。
结果:
[('64.17.30.238', '63141'), ('62.33.210.34', '58918'), ... ]
标题。我要指出的是,该项目从一个免费代理网站解析 IP、端口及其类型(https 或不),然后在 linux 上进行测试以确定它们是否有效。它将那些保存在元组中并将它们写入 csv。
import requests
import lxml
from bs4 import BeautifulSoup
import csv
names = []
url = 'https://free-proxy-list.net/'
page = requests.get(url)
soup = BeautifulSoup(page.content, features='lxml')
headers = soup.find_all('th')
headers_refined = []
headers_refined.append(headers[0])
headers_refined.append(headers[1])
headers_refined.append(headers[6])
ips = soup.find_all('td')
ips = ips[::8]
ports = soup.find_all('td')
ports = ports[1::8]
element_index = 0
for i in ips:
ips[element_index] = str(ips[element_index])
element_index += 1
element_index = 0
for i in headers_refined:
headers_refined[element_index] = str(headers_refined[element_index])
element_index += 1
element_index = 0
for i in ports:
ports[element_index] = str(ports[element_index])
element_index += 1
ips = ' '.join(ips).replace('<td>', '').split()
ips = ' '.join(ips).replace('</td>', '').split()
ips = ips[:-43:]
headers_refined = ' '.join(headers_refined).replace('<th>', '').split()
headers_refined = ' '.join(headers_refined).replace('</th>', '').split()
headers_refined = ' '.join(headers_refined).replace('<th class="hx">', '').split()
ports = ' '.join(ports).replace('<td>', '').split()
ports = ' '.join(ports).replace('</td>', '').split()
while len(ports)>len(ips):
ports=ports[:-1:]
prev_len_ips=len(ips)
index=0
for i in range(prev_len_ips):
ips.insert(i+1,ports[i])
# print(headers_refined)
# print(ips)
# print(ports)
print(prev_len_ips)
print(len(ports))
print(ips)
ips = [*zip(ips[::2])]
with open('ips.csv', '+w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(ips)
上面的代码按如下顺序打印出列表:
['IP','port','port','port','port',...]
直到用完所有可用端口。之后,它会打印列表中剩余的 IP。
P.S。我很乐意接受关于改进和优化我的代码以使其看起来更好的任何其他建议。提前致谢!
要从该页面获得所需内容,有更简单的方法可以实现。由于您已经在使用 lxml
作为解析器,这正是您所需要的:
from urllib.request import urlopen, Request
from lxml import etree
# free-proxy-list.net doesn't like Python announcing itself, use at your own risk
req = Request(
'https://free-proxy-list.net/',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
# reading the contents of the page, getting the part you need
with urlopen(req) as f:
root = etree.parse(f, parser=etree.HTMLParser())
# get the proxies from the only textarea on the page, skip the description and timestamp
proxies = root.xpath('*//textarea/text()')[0].split('\n')[3:]
# the format you want
proxies = [tuple(proxy.split(':')) for proxy in proxies]
print(proxies)
在 lxml
之外没有外部依赖(没有 bs4
或 requests
),只有几行代码。
结果:
[('64.17.30.238', '63141'), ('62.33.210.34', '58918'), ... ]