如何对 python 中的 URL 进行分层排序?
How do I hierarchically sort URLs in python?
给定从网站抓取的 URL 的初始列表:
https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs
我正在尝试将列表转换为按选项卡组织的树状层次结构:
https://somesite.com
/advertise
/articles
/read
/1154
/1155
/1156
/1157
/1158
/blogs
我试过使用列表、元组和字典。到目前为止,我已经想出了两种有缺陷的输出内容的方法。
如果元素在层次结构中具有相同的名称和位置,则方法 1 将丢失这些元素:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/stego
----------------^ Missing expected output "/0"
方法二不会遗漏任何元素,但会打印多余的内容:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/missions <- Redundant content
/playit <- Redundant content
/stego
/0
我不确定如何正确执行此操作,我的谷歌搜索只找到了对 urllib 的引用,但似乎不是我需要的。也许有更好的方法,但我一直找不到。
我将内容放入可用列表的代码:
#!/usr/bin/python3
import re
# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
raw_site_list = f.readlines()
# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)
# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
clean_line = line.strip(prefix).strip(domain).strip()
if not clean_line == "/":
if not clean_line[len(clean_line) - 1] == "/":
clean_site_list += [clean_line]
# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
split_site_list += [list(filter(None, site.split("/")))]
这给出了一个要操作的列表,但我运行不知道如何在不丢失元素或输出冗余元素的情况下输出它。
谢谢
编辑:这是我根据下面选择的答案组合在一起的最终工作代码:
# Read list of URLs from file
with open("sitelist.raw", "r") as f:
urls = f.readlines()
# Remove trailing newlines
for url in urls:
urls[urls.index(url)] = url[:-1]
# Remove any trailing slashes
for url in urls:
if url[-1:] == "/":
urls[urls.index(url)] = url[:-1]
# Remove duplicate lines
unique_urls = []
for url in urls:
if url not in unique_urls:
unique_urls += [url]
# Do the actual work (modified to use unique_urls and use tabs instead of 4x spaces, and to write to file)
base = unique_urls[0]
tabdepth = 0
tlen = len(base.split('/'))
final_urls = []
for url in unique_urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join(['\t' for _ in range(tabdepth)])
final_urls += [f'{pad}/{t[-1]}']
with open("sitelist.new", "wt") as f:
f.write(base + "\n")
for url in final_urls:
f.write(url + "\n")
这适用于您的示例数据:
urls = ['https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0']
base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))
for url in urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join([' ' for _ in range(tabdepth)])
print(f'{pad}/{t[-1]}')
此代码将帮助您完成任务。我同意这段代码可能有点大,可能包含一些冗余代码和检查,但这将创建一个包含 url 层次结构的字典,您可以随意使用该字典,打印或存储它。
此代码的更多内容还将解析不同的 url 并创建它们的单独树(参见代码和输出)
编辑:这也将处理冗余网址
代码:
from json import dumps
def process_urls(urls: list):
tree = {}
for url in urls:
url_components = url.split("/")
# First three components will be the protocol
# an empty entry
# and the base domain
base_domain = url_components[:3]
base_domain = base_domain[0] + "//" + "".join(base_domain[1:])
# Add base domain to tree if not there.
try:
tree[base_domain]
except:
tree[base_domain] = {}
structure = url_components[3:]
for i in range(len(structure)):
# add the first element
if i == 0 :
try:
tree[base_domain]["/"+structure[i]]
except:
tree[base_domain]["/"+structure[i]] = {}
else:
base = tree[base_domain]["/"+structure[0]]
for j in range(1, i):
base = base["/"+structure[j]]
try:
base["/"+structure[i]]
except:
base["/"+structure[i]] = {}
return tree
def print_tree(tree: dict, depth=0):
for key in tree.keys():
print("\t"*depth+key)
# redundant checks
if type(tree[key]) == dict:
# if dictionary is empty then do nothing
# else call this function recuressively
# increase depth by 1
if tree[key]:
print_tree(tree[key], depth+1)
if __name__ == "__main__":
urls = [
'https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0',
'https://somesite2.com/missions/playit',
'https://somesite2.com/missions/playit/extbasic',
'https://somesite2.com/missions/playit/extbasic/0',
'https://somesite2.com/missions/playit/stego',
'https://somesite2.com/missions/playit/stego/0'
]
tree = process_urls(urls)
print_tree(tree)
输出:
https://somesite.com
/missions
/playit
/extbasic
/0
/stego
/0
https://somesite2.com
/missions
/playit
/extbasic
/0
/stego
/0
给定从网站抓取的 URL 的初始列表:
https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs
我正在尝试将列表转换为按选项卡组织的树状层次结构:
https://somesite.com
/advertise
/articles
/read
/1154
/1155
/1156
/1157
/1158
/blogs
我试过使用列表、元组和字典。到目前为止,我已经想出了两种有缺陷的输出内容的方法。
如果元素在层次结构中具有相同的名称和位置,则方法 1 将丢失这些元素:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/stego
----------------^ Missing expected output "/0"
方法二不会遗漏任何元素,但会打印多余的内容:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/missions <- Redundant content
/playit <- Redundant content
/stego
/0
我不确定如何正确执行此操作,我的谷歌搜索只找到了对 urllib 的引用,但似乎不是我需要的。也许有更好的方法,但我一直找不到。
我将内容放入可用列表的代码:
#!/usr/bin/python3
import re
# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
raw_site_list = f.readlines()
# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)
# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
clean_line = line.strip(prefix).strip(domain).strip()
if not clean_line == "/":
if not clean_line[len(clean_line) - 1] == "/":
clean_site_list += [clean_line]
# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
split_site_list += [list(filter(None, site.split("/")))]
这给出了一个要操作的列表,但我运行不知道如何在不丢失元素或输出冗余元素的情况下输出它。
谢谢
编辑:这是我根据下面选择的答案组合在一起的最终工作代码:
# Read list of URLs from file
with open("sitelist.raw", "r") as f:
urls = f.readlines()
# Remove trailing newlines
for url in urls:
urls[urls.index(url)] = url[:-1]
# Remove any trailing slashes
for url in urls:
if url[-1:] == "/":
urls[urls.index(url)] = url[:-1]
# Remove duplicate lines
unique_urls = []
for url in urls:
if url not in unique_urls:
unique_urls += [url]
# Do the actual work (modified to use unique_urls and use tabs instead of 4x spaces, and to write to file)
base = unique_urls[0]
tabdepth = 0
tlen = len(base.split('/'))
final_urls = []
for url in unique_urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join(['\t' for _ in range(tabdepth)])
final_urls += [f'{pad}/{t[-1]}']
with open("sitelist.new", "wt") as f:
f.write(base + "\n")
for url in final_urls:
f.write(url + "\n")
这适用于您的示例数据:
urls = ['https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0']
base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))
for url in urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join([' ' for _ in range(tabdepth)])
print(f'{pad}/{t[-1]}')
此代码将帮助您完成任务。我同意这段代码可能有点大,可能包含一些冗余代码和检查,但这将创建一个包含 url 层次结构的字典,您可以随意使用该字典,打印或存储它。
此代码的更多内容还将解析不同的 url 并创建它们的单独树(参见代码和输出)
编辑:这也将处理冗余网址
代码:
from json import dumps
def process_urls(urls: list):
tree = {}
for url in urls:
url_components = url.split("/")
# First three components will be the protocol
# an empty entry
# and the base domain
base_domain = url_components[:3]
base_domain = base_domain[0] + "//" + "".join(base_domain[1:])
# Add base domain to tree if not there.
try:
tree[base_domain]
except:
tree[base_domain] = {}
structure = url_components[3:]
for i in range(len(structure)):
# add the first element
if i == 0 :
try:
tree[base_domain]["/"+structure[i]]
except:
tree[base_domain]["/"+structure[i]] = {}
else:
base = tree[base_domain]["/"+structure[0]]
for j in range(1, i):
base = base["/"+structure[j]]
try:
base["/"+structure[i]]
except:
base["/"+structure[i]] = {}
return tree
def print_tree(tree: dict, depth=0):
for key in tree.keys():
print("\t"*depth+key)
# redundant checks
if type(tree[key]) == dict:
# if dictionary is empty then do nothing
# else call this function recuressively
# increase depth by 1
if tree[key]:
print_tree(tree[key], depth+1)
if __name__ == "__main__":
urls = [
'https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0',
'https://somesite2.com/missions/playit',
'https://somesite2.com/missions/playit/extbasic',
'https://somesite2.com/missions/playit/extbasic/0',
'https://somesite2.com/missions/playit/stego',
'https://somesite2.com/missions/playit/stego/0'
]
tree = process_urls(urls)
print_tree(tree)
输出:
https://somesite.com
/missions
/playit
/extbasic
/0
/stego
/0
https://somesite2.com
/missions
/playit
/extbasic
/0
/stego
/0