使用 R 和不使用选择器的 Jazzy Scraping
Jazzy Scraping with R and without Selectors
我一直在使用 rvest
抓取页面,我知道 selectorGadget
的好处。但是,一页包含没有选择器的数据。 HTML 的片段如下。页面 is here 。我正在尝试抓取列出的每张爵士乐专辑中的人员名单。在下面的 HTML 片段中,人员数据以 "Sonny Rollins, tenor sax..." 开头,如您所见,该文本未被任何 CSS 选择器包围。有什么建议可以解决这个问题吗?
<h1>Blue Note Records Catalog: 4000 series</h1>
<div id="start-here"><!-- id="start-here" --></div>
<div id="catalog-data">
<h2>Modern Jazz 4000 series (12 inch LP)</h2>
<h3><a href="./album-index/#blp-4001" name="blp-4001">BLP 4001 Sonny
Rollins - Newk's Time <i>1959</i></a></h3>
Sonny Rollins, tenor sax; Wynton Kelly, piano #1,2,4-6; Doug Watkins, bass
#1,2,4-6; Philly Joe Jones, drums.
<div class="date">Van Gelder Studio, Hackensack, NJ, September 22,
1957</div>
<table width="100%">
<tr><td width="15%">1. tk.5<td>Tune Up
等等...
使用xpath提取并使用正则表达式过滤掉元素。以下脚本应该可以工作。
library(rvest)
library(stringr)
texts <- read_html("https://www.jazzdisco.org/blue-note-records/catalog-4000-series/") %>%
html_nodes(xpath = '//*[@id="catalog-data"]/text()') %>%
html_text()
texts[!str_detect(texts,"(^\n$)|(^\n\*\*)")] # I just notcie this line doesn't clean up the string entirely, you can figure out better regex.
关于拆分字符串,你可以试试下面的代码:
sample_str <- "\nIke Quebec, tenor sax; Sonny Clark, piano; Grant Green, guitar; Sam Jones, bass; Louis Hayes, drums.\n"
str_trim(sample_str) %>%
str_split(",")
returns:
[[1]]
[1] "Ike Quebec" " tenor sax; Sonny Clark" " piano; Grant Green" " guitar; Sam Jones" " bass; Louis Hayes" " drums."
我一直在使用 rvest
抓取页面,我知道 selectorGadget
的好处。但是,一页包含没有选择器的数据。 HTML 的片段如下。页面 is here 。我正在尝试抓取列出的每张爵士乐专辑中的人员名单。在下面的 HTML 片段中,人员数据以 "Sonny Rollins, tenor sax..." 开头,如您所见,该文本未被任何 CSS 选择器包围。有什么建议可以解决这个问题吗?
<h1>Blue Note Records Catalog: 4000 series</h1>
<div id="start-here"><!-- id="start-here" --></div>
<div id="catalog-data">
<h2>Modern Jazz 4000 series (12 inch LP)</h2>
<h3><a href="./album-index/#blp-4001" name="blp-4001">BLP 4001 Sonny
Rollins - Newk's Time <i>1959</i></a></h3>
Sonny Rollins, tenor sax; Wynton Kelly, piano #1,2,4-6; Doug Watkins, bass
#1,2,4-6; Philly Joe Jones, drums.
<div class="date">Van Gelder Studio, Hackensack, NJ, September 22,
1957</div>
<table width="100%">
<tr><td width="15%">1. tk.5<td>Tune Up
等等...
使用xpath提取并使用正则表达式过滤掉元素。以下脚本应该可以工作。
library(rvest)
library(stringr)
texts <- read_html("https://www.jazzdisco.org/blue-note-records/catalog-4000-series/") %>%
html_nodes(xpath = '//*[@id="catalog-data"]/text()') %>%
html_text()
texts[!str_detect(texts,"(^\n$)|(^\n\*\*)")] # I just notcie this line doesn't clean up the string entirely, you can figure out better regex.
关于拆分字符串,你可以试试下面的代码:
sample_str <- "\nIke Quebec, tenor sax; Sonny Clark, piano; Grant Green, guitar; Sam Jones, bass; Louis Hayes, drums.\n"
str_trim(sample_str) %>%
str_split(",")
returns:
[[1]]
[1] "Ike Quebec" " tenor sax; Sonny Clark" " piano; Grant Green" " guitar; Sam Jones" " bass; Louis Hayes" " drums."