为 GNU parallels 优化我的脚本代码

Optimising my script code for GNU parallels

我有一个脚本可以成功查询 API,但速度很慢。获取所有资源大约需要 16 个小时。我研究了如何优化它,我认为使用 GNU parallels(通过 Brew 安装在 macos 上,版本 20180522)可以解决问题。但即使使用 90 个作业(API 端点最多授权 100 个连接),我的脚本也不会更快。我不知道为什么。

我这样调用我的脚本:

bash script.sh | parallel -j90

脚本如下:

#!bin/bash 

# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment


# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file 
main(){
local file=""
local line
local sign
local auteur_clean
local cosign_clean

while read line
    do
        sign="${line}/signataires"
        auteur_clean=$(auteur $sign)
        cosign_clean=$(cosignataires $sign)
        echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}

# The auteur function takes the $sign variable as an input and 
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
#  3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url=""
local auteur
local auteur_nom

auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}

# The cosignataires function takes the $sign variable as an input and 
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url=""
local cosign
local cosign_nom
local i

cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}

main "url_amendements_15.txt"

url_amendements_15.txt 的内容如下所示:

https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161

您的脚本循环遍历 URL 的列表并按顺序查询它们。您需要将其分解,以便每个 API 查询单独完成,这样 parallel 将具有可以并行执行的命令。

更改脚本,使其只需要一个 URL。摆脱主 while 循环。

main() {
    local url=
    local sign
    local auteur_clean
    local cosign_clean

    sign=$url/signataires
    auteur_clean=$(auteur "$sign")
    cosign_clean=$(cosignataires "$sign")
    echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
}

然后将url_amendements_15.txt传给parallel。给一个可以并行处理的URL的列表。

parallel -j90 script.sh < url_amendements_15.txt