在将 webscraped 输出转换为 tibble 时,根据任意字符发出新行应该从哪里开始的信号

Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble

我正在抓取一个电视脚本,然后尝试对其进行清理。这是我目前所拥有的:

library(tidyverse)
library(rvest)

s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')

s1_e1 <- s1_e1 %>%
  html_nodes("p") %>%
  html_text() 

s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\([^\)]+\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\[[^\]]+\]", replacement = "") 
s1_e1 <- str_squish(s1_e1)

s1_e1 <- s1_e1 %>% 
  as_tibble() %>% 
  filter(value!="") %>% 
  mutate(season = "27",
         episode_num = "1",
         airdate_orig = str_sub(.$value[1], -12),
         episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>% 
  slice(-1)

这给了我以下内容:

# A tibble: 38 x 5
   value                                    season episode_num airdate_orig episode_name
   <chr>                                    <chr>  <chr>       <chr>        <chr>       
 1 ROSE: Bye! JACKIE: See you later!        27     1           26 Mar, 2005 Rose        
 2 TANNOY: This is a customer announcement… 27     1           26 Mar, 2005 Rose        
 3 ROSE: You pulled his arm off. DOCTOR: Y… 27     1           26 Mar, 2005 Rose        
 4 ROSE: That's just not funny. That's sic… 27     1           26 Mar, 2005 Rose        
 5 TAXI DRIVER: Watch it!                   27     1           26 Mar, 2005 Rose        
 6 TELEVISION: The whole of Central London… 27     1           26 Mar, 2005 Rose        
 7 JACKIE: There's no point in getting up,… 27     1           26 Mar, 2005 Rose        
 8 JACKIE: There's Finch's. You could try … 27     1           26 Mar, 2005 Rose        
 9 ROSE: It's about last night. He's part … 27     1           26 Mar, 2005 Rose        
10 ROSE: Don't mind the mess. Do you want … 27     1           26 Mar, 2005 Rose        
# … with 28 more rows

我希望每一行都是一个新角色的演讲。如您所见,值得庆幸的是,该脚本将说话者大写,然后在新语音之前有一个冒号和一个 space,即 ROSE: TANNOY: 。有没有办法向 R 表明我希望 tibble 的每一行都以此大写文本开头,后跟一个冒号,并在该行中继续,直到出现另一个大写单词后跟一个冒号?

例如,第一行将以 ROSE: Bye! 开头,第二行将以 JACKIE: See you later! 开头,第三行将以 TANNOY: This is a customer announcement… 开头,直到到达另一个大写单词后跟一个冒号,等等。

此外,如果有人对我如何将 stringr 函数集成到 dplyr 块中有任何建议,请告诉我。如果最好的话,我可以对此单独做一个 post,但是我在尝试这样做时总是遇到错误(尽管以上是功能性的)。

非常感谢!

您可以使用前瞻模式:

library(tidyverse)

s1_e1 %>% 
  mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>% 
  unnest(value)

returns

# A tibble: 322 x 5
   value                                                             season episode_num airdate_orig episode_name
   <chr>                                                             <chr>  <chr>       <chr>        <chr>       
 1 ROSE: Bye!                                                        27     1           26 Mar, 2005 Rose        
 2 JACKIE: See you later!                                            27     1           26 Mar, 2005 Rose        
 3 TANNOY: This is a customer announcement. The store will be closi~ 27     1           26 Mar, 2005 Rose        
 4 GUARD: Oi!                                                        27     1           26 Mar, 2005 Rose        
 5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27     1           26 Mar, 2005 Rose        
 6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27     1           26 Mar, 2005 Rose        
 7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson?             27     1           26 Mar, 2005 Rose        
 8 ROSE: Wilson? Wilson!                                             27     1           26 Mar, 2005 Rose        
 9 ROSE: You're kidding me.                                          27     1           26 Mar, 2005 Rose        
10 ROSE: Is that someone mucking about? Who is it?                   27     1           26 Mar, 2005 Rose    

简化的工作流程

您确实可以将所有操作都放在一个管道中:

s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
  html_nodes("p") %>%
  html_text() %>% 
  tibble(value = .) %>% 
  mutate(value = str_squish(str_replace_all(value, "(\s*\([^\)]+\)|\s*\[[^\]]+\])", ""))) %>% 
  filter(value!="") %>% 
  mutate(season = "27",
         episode_num = "1",
         airdate_orig = str_sub(.$value[1], -12),
         episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>% 
  slice(-1) %>% 
  mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>% 
  unnest(value)