Bash sed 命令问题
Bash sed command issue
我正在尝试进一步解析我使用附加 grep 命令生成的输出文件。我目前使用的代码是:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
样本生成的输出是:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (à¤à¤¾à¤°à¤¤)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
我如何更改代码以进一步解析输出以仅获取从电影标题到 imdb 评级的数据(在本例中,包含标题 'Duck Soup' 的行直到结束。
使用sed
$ sed -n '/\*[^[:alpha:] ]*\*/,$ p' input_file
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
代码如下:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
请(!)查看以下网址,了解为什么用 sed
解析 HTML 是一个非常糟糕的主意:
- RegEx match open tags except XHTML self-contained tags
- Using regular expressions to parse HTML: why not?
- Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
您尝试做的事情可以通过 HTML/XML/JSON 解析器 xidel 完成,并且只需调用 1 次!
在此示例中,我将使用 IMDB of Charlie Chaplin 作为来源。
提取所有 94 个“演员”IMDB 电影网址:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
无需将这些保存到 text-file。只需使用 -f
(--follow
) 而不是 -e
并且 xidel
将打开所有这些。
对于单个电影网址,您可以解析HTML以获得您想要的text-nodes...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...但是对于那些 class
-名称,我会说这是一个相当脆弱的努力。相反,我建议在 <script>
-node:
中解析 HTML-source 顶部的 JSON
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[@type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...或获得与上述类似的输出:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[@type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
全部合并:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
-e '
parse-json(//script[@type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10
我正在尝试进一步解析我使用附加 grep 命令生成的输出文件。我目前使用的代码是:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
样本生成的输出是:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (à¤à¤¾à¤°à¤¤)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
我如何更改代码以进一步解析输出以仅获取从电影标题到 imdb 评级的数据(在本例中,包含标题 'Duck Soup' 的行直到结束。
使用sed
$ sed -n '/\*[^[:alpha:] ]*\*/,$ p' input_file
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
代码如下:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
请(!)查看以下网址,了解为什么用 sed
解析 HTML 是一个非常糟糕的主意:
- RegEx match open tags except XHTML self-contained tags
- Using regular expressions to parse HTML: why not?
- Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
您尝试做的事情可以通过 HTML/XML/JSON 解析器 xidel 完成,并且只需调用 1 次!
在此示例中,我将使用 IMDB of Charlie Chaplin 作为来源。
提取所有 94 个“演员”IMDB 电影网址:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
无需将这些保存到 text-file。只需使用 -f
(--follow
) 而不是 -e
并且 xidel
将打开所有这些。
对于单个电影网址,您可以解析HTML以获得您想要的text-nodes...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...但是对于那些 class
-名称,我会说这是一个相当脆弱的努力。相反,我建议在 <script>
-node:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[@type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...或获得与上述类似的输出:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[@type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
全部合并:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
-e '
parse-json(//script[@type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10