根据域名使 URL 唯一
Make URLs unique based on their domain name
我有一个名为 urls.list:
的 URL 列表
https://target.com/?first=one
https://target.com/something/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
https://example.com/?third=three
我想根据它们的域使它们独一无二,例如 https://target.com
,这意味着每个域及其协议打印一次,并避免下一个 URL。
所以结果将是:
https://target.com/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
这是我尝试做的:
cat urls.list | cut -d"/" -f1-3 | awk '!a[[=12=]]++' >> host_unique.del
for urls in $(cat urls.list); do
for hosts in $(cat host_unique.del); do
if [[ $hosts == *"$urls"* ]]; then
echo "$hosts"
fi
done
done
使用您展示的示例,请尝试以下操作。
awk 'match([=10=],/https?:\/\/[^/]*/){val=substr([=10=],RSTART,RLENGTH)} !arr[val]++' Input_file
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
match([=11=],/https?:\/\/[^/]*/){ ##using match to match http or https followedby ://
val=substr([=11=],RSTART,RLENGTH) ##Creating val which has matched string value here.
}
!arr[val]++ ##Checking condition if val not present in arr then print current line.
' Input_file ##Mentioning Input_file name here.
This awk
might do what you wanted.
awk -F'/' '!seen[,]++' urls.list
A bash alternative would be very slow on large set of data/files but here it is.
Using mapfile
aka readarray
which is a bash4+ feature, associative array. plus some more bash features.
#!/usr/bin/env bash
declare -A uniq
mapfile -t urls < urls.list
for uniq_url in "${urls[@]}"; do
IFS='/' read -ra url <<< "$uniq_url"
if ((!uniq["${url[0]}","${url[2]}"]++)); then
printf '%s\n' "$uniq_url"
fi
done
我有一个名为 urls.list:
的 URL 列表https://target.com/?first=one
https://target.com/something/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
https://example.com/?third=three
我想根据它们的域使它们独一无二,例如 https://target.com
,这意味着每个域及其协议打印一次,并避免下一个 URL。
所以结果将是:
https://target.com/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
这是我尝试做的:
cat urls.list | cut -d"/" -f1-3 | awk '!a[[=12=]]++' >> host_unique.del
for urls in $(cat urls.list); do
for hosts in $(cat host_unique.del); do
if [[ $hosts == *"$urls"* ]]; then
echo "$hosts"
fi
done
done
使用您展示的示例,请尝试以下操作。
awk 'match([=10=],/https?:\/\/[^/]*/){val=substr([=10=],RSTART,RLENGTH)} !arr[val]++' Input_file
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
match([=11=],/https?:\/\/[^/]*/){ ##using match to match http or https followedby ://
val=substr([=11=],RSTART,RLENGTH) ##Creating val which has matched string value here.
}
!arr[val]++ ##Checking condition if val not present in arr then print current line.
' Input_file ##Mentioning Input_file name here.
This awk
might do what you wanted.
awk -F'/' '!seen[,]++' urls.list
A bash alternative would be very slow on large set of data/files but here it is.
Using mapfile
aka readarray
which is a bash4+ feature, associative array. plus some more bash features.
#!/usr/bin/env bash
declare -A uniq
mapfile -t urls < urls.list
for uniq_url in "${urls[@]}"; do
IFS='/' read -ra url <<< "$uniq_url"
if ((!uniq["${url[0]}","${url[2]}"]++)); then
printf '%s\n' "$uniq_url"
fi
done