为 GNU parallels 优化我的脚本代码
Optimising my script code for GNU parallels
我有一个脚本可以成功查询 API,但速度很慢。获取所有资源大约需要 16 个小时。我研究了如何优化它,我认为使用 GNU parallels(通过 Brew 安装在 macos 上,版本 20180522)可以解决问题。但即使使用 90 个作业(API 端点最多授权 100 个连接),我的脚本也不会更快。我不知道为什么。
我这样调用我的脚本:
bash script.sh | parallel -j90
脚本如下:
#!bin/bash
# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment
# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file
main(){
local file=""
local line
local sign
local auteur_clean
local cosign_clean
while read line
do
sign="${line}/signataires"
auteur_clean=$(auteur $sign)
cosign_clean=$(cosignataires $sign)
echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}
# The auteur function takes the $sign variable as an input and
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
# 3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url=""
local auteur
local auteur_nom
auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}
# The cosignataires function takes the $sign variable as an input and
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url=""
local cosign
local cosign_nom
local i
cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}
main "url_amendements_15.txt"
url_amendements_15.txt
的内容如下所示:
https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161
您的脚本循环遍历 URL 的列表并按顺序查询它们。您需要将其分解,以便每个 API 查询单独完成,这样 parallel
将具有可以并行执行的命令。
更改脚本,使其只需要一个 URL。摆脱主 while
循环。
main() {
local url=
local sign
local auteur_clean
local cosign_clean
sign=$url/signataires
auteur_clean=$(auteur "$sign")
cosign_clean=$(cosignataires "$sign")
echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
}
然后将url_amendements_15.txt
传给parallel
。给它一个可以并行处理的URL的列表。
parallel -j90 script.sh < url_amendements_15.txt
我有一个脚本可以成功查询 API,但速度很慢。获取所有资源大约需要 16 个小时。我研究了如何优化它,我认为使用 GNU parallels(通过 Brew 安装在 macos 上,版本 20180522)可以解决问题。但即使使用 90 个作业(API 端点最多授权 100 个连接),我的脚本也不会更快。我不知道为什么。
我这样调用我的脚本:
bash script.sh | parallel -j90
脚本如下:
#!bin/bash
# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment
# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file
main(){
local file=""
local line
local sign
local auteur_clean
local cosign_clean
while read line
do
sign="${line}/signataires"
auteur_clean=$(auteur $sign)
cosign_clean=$(cosignataires $sign)
echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}
# The auteur function takes the $sign variable as an input and
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
# 3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url=""
local auteur
local auteur_nom
auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}
# The cosignataires function takes the $sign variable as an input and
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url=""
local cosign
local cosign_nom
local i
cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}
main "url_amendements_15.txt"
url_amendements_15.txt
的内容如下所示:
https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161
您的脚本循环遍历 URL 的列表并按顺序查询它们。您需要将其分解,以便每个 API 查询单独完成,这样 parallel
将具有可以并行执行的命令。
更改脚本,使其只需要一个 URL。摆脱主 while
循环。
main() {
local url=
local sign
local auteur_clean
local cosign_clean
sign=$url/signataires
auteur_clean=$(auteur "$sign")
cosign_clean=$(cosignataires "$sign")
echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
}
然后将url_amendements_15.txt
传给parallel
。给它一个可以并行处理的URL的列表。
parallel -j90 script.sh < url_amendements_15.txt