Bash sed 命令问题

Question

我正在尝试进一步解析我使用附加 grep 命令生成的输出文件。我目前使用的代码是：

##!/bin/bash

# fetches the links of the movie's imdb pages for a given actor

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi


curl "https://www.imdb.com/name/$code/#actor" | grep -Eo 
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' | 
sort -u > imdb_links.txt

#parses each of the link in the link text file and gets the details for 
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt) 
do 
   curl $i | 
   html2text | 
   sed -n '/Sign_In/,$p'|  
   sed -n '/YOUR RATING/q;p' | 
   head -n-1 | 
   tail -n+2 
done > imdb_all.txt

样本生成的输出是：

EN
⁰
    * Fully supported
    * English (United States)
    * Partially_supported
    * FranÃ§ais (Canada)
    * FranÃ§ais (France)
    * Deutsch (Deutschland)
    * à¤¹à¤¿à¤‚à¤¦à¥€ (à¤à¤¾à¤°à¤¤)
    * Italiano (Italia)
    * PortuguÃªs (Brasil)
    * EspaÃ±ol (EspaÃ±a)
    * EspaÃ±ol (MÃ©xico)
****** Duck Soup ******
    * 19331933
    * Not_RatedNot Rated
    * 1h 9m
IMDb RATING
7.8/10

我如何更改代码以进一步解析输出以仅获取从电影标题到 imdb 评级的数据（在本例中，包含标题 'Duck Soup' 的行直到结束。

Answer 1

使用sed

$ sed -n '/\*[^[:alpha:] ]*\*/,$ p' input_file
****** Duck Soup ******
    * 19331933
    * Not_RatedNot Rated
    * 1h 9m
IMDb RATING
7.8/10

Answer 2

代码如下：

#!/bin/bash

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ]; then
  code="nm0000122"
else
  code="nm0000050"
fi

rm -f imdb_links.txt

curl "https://www.imdb.com/name/$code/#actor" |
  grep -Eo 'href="/title/[^"]*' |
  sed 's#^href="#https://www.imdb.com#g' |
  sort -u |
while read link; do
   # uncomment the next line to save links into file:
   #echo "$link" >>imdb_links.txt

   curl "$link" |
     html2text -utf8 |
     sed -n '/Sign_In/,/YOUR RATING/ p' |
     sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt

Answer 3

请（！）查看以下网址，了解为什么用 sed 解析 HTML 是一个非常糟糕的主意：

RegEx match open tags except XHTML self-contained tags
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

您尝试做的事情可以通过 HTML/XML/JSON 解析器 xidel 完成，并且只需调用 1 次！
在此示例中，我将使用 IMDB of Charlie Chaplin 作为来源。

提取所有 94 个“演员”IMDB 电影网址：

$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
  //div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94

无需将这些保存到 text-file。只需使用 -f (--follow) 而不是 -e 并且 xidel 将打开所有这些。

对于单个电影网址，您可以解析HTML以获得您想要的text-nodes...

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  //h1,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
  (//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10

...但是对于那些 class-名称，我会说这是一个相当脆弱的努力。相反，我建议在 <script>-node:

中解析 HTML-source 顶部的 JSON

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    datePublished,
    duration,
    aggregateRating/ratingValue
  )
'
A Countess from Hong Kong
1967-03-15
PT2H
6

...或获得与上述类似的输出：

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    year-from-date(date(datePublished)),
    substring(lower-case(duration),3),
    format-number(aggregateRating/ratingValue,"#.0")||"/10"
  )
'
A Countess from Hong Kong
1967
2h
6.0/10

全部合并：

$ xidel -s "https://www.imdb.com/name/nm0000122" \
  -f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
  -e '
    parse-json(//script[@type="application/ld+json"])/(
      name,
      year-from-date(date(datePublished)),
      substring(lower-case(duration),3),
      format-number(aggregateRating/ratingValue,"#.0")||"/10"
    )
  '
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

Bash sed 命令问题

Bash sed command issue

bash

sed

web-scraping