Web 抓取 returns 合并 R 中的句子
Web scraping returns merged sentences in R
我从 link https://www.vagalume.com.br/ivete-sangalo/ 中抓取了一些歌词。在那里,歌词显示如下(只是一个片段):
Quando a chuva passar
Pra quê falar
Se você não quer me ouvir?
Fugir agora não resolve nada
如您所见,每个句子都换行。但是当我把歌词刮下来存成csv文件的时候,Rreturns合并了句子,如下:
Output:
Quando a chuva passarPra quê falarSe você não quer me ouvir?Fugir agora não resolve nada
这是我的代码:
library(rvest)
library(dplyr)
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_id <- page %>% html_nodes('.nameMusic') %>% html_attr("href")
name_link_full <-page %>% html_nodes('.nameMusic') %>% html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
get_lyrics <- function(lyrics_link){
lyric <- read_html(lyrics_link)
all_lyrics <- lyric %>% html_nodes('#lyrics') %>% html_text()
return(all_lyrics)
}
lyr <- sapply(name_link_full, FUN = get_lyrics)
lyrs <- data.frame(lyr, stringsAsFactors = FALSE)
write.csv(lyrs, 'Ivete.Sangalo.csv')
我已尝试 stringi()
、strsplit()
,但没有任何变化。请问,我该如何解决这个问题?
以下函数读取数据和 returns 一个 data.frame,其中一列名为 lyrics
。
library(rvest)
library(dplyr)
get_lyrics <- function(lyrics_link){
lyrics_link %>%
read_html() %>%
html_nodes('#lyrics') %>%
html_text2() %>%
gsub("\n\n", "\n", .) %>%
str_split(pattern = "\n") %>%
unlist() %>%
as.data.frame() %>%
`names<-`("lyrics")
}
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_full <- page %>%
html_nodes('.nameMusic') %>%
html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
编辑
根据下面的评论,这里有两种将歌词写入文件的方法。
首先,rbind
列表 lyr
的向量。并删除列 header.
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
lyrs <- lapply(lyr, \(l) paste(unlist(l), collapse = " "))
lyrs <- do.call(rbind.data.frame, lyrs)
names(lyrs) <- ''
然后,写成 csv 和 txt。目录 "~/tmp"
是可选的。
old_dir <- getwd()
setwd("~/tmp")
write.csv(lyrs, 'Ivete.Sangalo.csv', quote = FALSE, row.names = FALSE)
writeLines(unlist(lyrs), con = 'Ivete.Sangalo.txt')
setwd(old_dir)
我从 link https://www.vagalume.com.br/ivete-sangalo/ 中抓取了一些歌词。在那里,歌词显示如下(只是一个片段):
Quando a chuva passar
Pra quê falar
Se você não quer me ouvir?
Fugir agora não resolve nada
如您所见,每个句子都换行。但是当我把歌词刮下来存成csv文件的时候,Rreturns合并了句子,如下:
Output:
Quando a chuva passarPra quê falarSe você não quer me ouvir?Fugir agora não resolve nada
这是我的代码:
library(rvest)
library(dplyr)
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_id <- page %>% html_nodes('.nameMusic') %>% html_attr("href")
name_link_full <-page %>% html_nodes('.nameMusic') %>% html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
get_lyrics <- function(lyrics_link){
lyric <- read_html(lyrics_link)
all_lyrics <- lyric %>% html_nodes('#lyrics') %>% html_text()
return(all_lyrics)
}
lyr <- sapply(name_link_full, FUN = get_lyrics)
lyrs <- data.frame(lyr, stringsAsFactors = FALSE)
write.csv(lyrs, 'Ivete.Sangalo.csv')
我已尝试 stringi()
、strsplit()
,但没有任何变化。请问,我该如何解决这个问题?
以下函数读取数据和 returns 一个 data.frame,其中一列名为 lyrics
。
library(rvest)
library(dplyr)
get_lyrics <- function(lyrics_link){
lyrics_link %>%
read_html() %>%
html_nodes('#lyrics') %>%
html_text2() %>%
gsub("\n\n", "\n", .) %>%
str_split(pattern = "\n") %>%
unlist() %>%
as.data.frame() %>%
`names<-`("lyrics")
}
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_full <- page %>%
html_nodes('.nameMusic') %>%
html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
编辑
根据下面的评论,这里有两种将歌词写入文件的方法。
首先,rbind
列表 lyr
的向量。并删除列 header.
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
lyrs <- lapply(lyr, \(l) paste(unlist(l), collapse = " "))
lyrs <- do.call(rbind.data.frame, lyrs)
names(lyrs) <- ''
然后,写成 csv 和 txt。目录 "~/tmp"
是可选的。
old_dir <- getwd()
setwd("~/tmp")
write.csv(lyrs, 'Ivete.Sangalo.csv', quote = FALSE, row.names = FALSE)
writeLines(unlist(lyrs), con = 'Ivete.Sangalo.txt')
setwd(old_dir)