在将 webscraped 输出转换为 tibble 时,根据任意字符发出新行应该从哪里开始的信号
Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble
我正在抓取一个电视脚本,然后尝试对其进行清理。这是我目前所拥有的:
library(tidyverse)
library(rvest)
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')
s1_e1 <- s1_e1 %>%
html_nodes("p") %>%
html_text()
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\([^\)]+\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\[[^\]]+\]", replacement = "")
s1_e1 <- str_squish(s1_e1)
s1_e1 <- s1_e1 %>%
as_tibble() %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1)
这给了我以下内容:
# A tibble: 38 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! JACKIE: See you later! 27 1 26 Mar, 2005 Rose
2 TANNOY: This is a customer announcement… 27 1 26 Mar, 2005 Rose
3 ROSE: You pulled his arm off. DOCTOR: Y… 27 1 26 Mar, 2005 Rose
4 ROSE: That's just not funny. That's sic… 27 1 26 Mar, 2005 Rose
5 TAXI DRIVER: Watch it! 27 1 26 Mar, 2005 Rose
6 TELEVISION: The whole of Central London… 27 1 26 Mar, 2005 Rose
7 JACKIE: There's no point in getting up,… 27 1 26 Mar, 2005 Rose
8 JACKIE: There's Finch's. You could try … 27 1 26 Mar, 2005 Rose
9 ROSE: It's about last night. He's part … 27 1 26 Mar, 2005 Rose
10 ROSE: Don't mind the mess. Do you want … 27 1 26 Mar, 2005 Rose
# … with 28 more rows
我希望每一行都是一个新角色的演讲。如您所见,值得庆幸的是,该脚本将说话者大写,然后在新语音之前有一个冒号和一个 space,即 ROSE:
或 TANNOY:
。有没有办法向 R 表明我希望 tibble 的每一行都以此大写文本开头,后跟一个冒号,并在该行中继续,直到出现另一个大写单词后跟一个冒号?
例如,第一行将以 ROSE: Bye!
开头,第二行将以 JACKIE: See you later!
开头,第三行将以 TANNOY: This is a customer announcement…
开头,直到到达另一个大写单词后跟一个冒号,等等。
此外,如果有人对我如何将 stringr 函数集成到 dplyr 块中有任何建议,请告诉我。如果最好的话,我可以对此单独做一个 post,但是我在尝试这样做时总是遇到错误(尽管以上是功能性的)。
非常感谢!
您可以使用前瞻模式:
library(tidyverse)
s1_e1 %>%
mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>%
unnest(value)
returns
# A tibble: 322 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! 27 1 26 Mar, 2005 Rose
2 JACKIE: See you later! 27 1 26 Mar, 2005 Rose
3 TANNOY: This is a customer announcement. The store will be closi~ 27 1 26 Mar, 2005 Rose
4 GUARD: Oi! 27 1 26 Mar, 2005 Rose
5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27 1 26 Mar, 2005 Rose
6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27 1 26 Mar, 2005 Rose
7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson? 27 1 26 Mar, 2005 Rose
8 ROSE: Wilson? Wilson! 27 1 26 Mar, 2005 Rose
9 ROSE: You're kidding me. 27 1 26 Mar, 2005 Rose
10 ROSE: Is that someone mucking about? Who is it? 27 1 26 Mar, 2005 Rose
简化的工作流程
您确实可以将所有操作都放在一个管道中:
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
html_nodes("p") %>%
html_text() %>%
tibble(value = .) %>%
mutate(value = str_squish(str_replace_all(value, "(\s*\([^\)]+\)|\s*\[[^\]]+\])", ""))) %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1) %>%
mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>%
unnest(value)
我正在抓取一个电视脚本,然后尝试对其进行清理。这是我目前所拥有的:
library(tidyverse)
library(rvest)
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')
s1_e1 <- s1_e1 %>%
html_nodes("p") %>%
html_text()
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\([^\)]+\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\[[^\]]+\]", replacement = "")
s1_e1 <- str_squish(s1_e1)
s1_e1 <- s1_e1 %>%
as_tibble() %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1)
这给了我以下内容:
# A tibble: 38 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! JACKIE: See you later! 27 1 26 Mar, 2005 Rose
2 TANNOY: This is a customer announcement… 27 1 26 Mar, 2005 Rose
3 ROSE: You pulled his arm off. DOCTOR: Y… 27 1 26 Mar, 2005 Rose
4 ROSE: That's just not funny. That's sic… 27 1 26 Mar, 2005 Rose
5 TAXI DRIVER: Watch it! 27 1 26 Mar, 2005 Rose
6 TELEVISION: The whole of Central London… 27 1 26 Mar, 2005 Rose
7 JACKIE: There's no point in getting up,… 27 1 26 Mar, 2005 Rose
8 JACKIE: There's Finch's. You could try … 27 1 26 Mar, 2005 Rose
9 ROSE: It's about last night. He's part … 27 1 26 Mar, 2005 Rose
10 ROSE: Don't mind the mess. Do you want … 27 1 26 Mar, 2005 Rose
# … with 28 more rows
我希望每一行都是一个新角色的演讲。如您所见,值得庆幸的是,该脚本将说话者大写,然后在新语音之前有一个冒号和一个 space,即 ROSE:
或 TANNOY:
。有没有办法向 R 表明我希望 tibble 的每一行都以此大写文本开头,后跟一个冒号,并在该行中继续,直到出现另一个大写单词后跟一个冒号?
例如,第一行将以 ROSE: Bye!
开头,第二行将以 JACKIE: See you later!
开头,第三行将以 TANNOY: This is a customer announcement…
开头,直到到达另一个大写单词后跟一个冒号,等等。
此外,如果有人对我如何将 stringr 函数集成到 dplyr 块中有任何建议,请告诉我。如果最好的话,我可以对此单独做一个 post,但是我在尝试这样做时总是遇到错误(尽管以上是功能性的)。
非常感谢!
您可以使用前瞻模式:
library(tidyverse)
s1_e1 %>%
mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>%
unnest(value)
returns
# A tibble: 322 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! 27 1 26 Mar, 2005 Rose
2 JACKIE: See you later! 27 1 26 Mar, 2005 Rose
3 TANNOY: This is a customer announcement. The store will be closi~ 27 1 26 Mar, 2005 Rose
4 GUARD: Oi! 27 1 26 Mar, 2005 Rose
5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27 1 26 Mar, 2005 Rose
6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27 1 26 Mar, 2005 Rose
7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson? 27 1 26 Mar, 2005 Rose
8 ROSE: Wilson? Wilson! 27 1 26 Mar, 2005 Rose
9 ROSE: You're kidding me. 27 1 26 Mar, 2005 Rose
10 ROSE: Is that someone mucking about? Who is it? 27 1 26 Mar, 2005 Rose
简化的工作流程
您确实可以将所有操作都放在一个管道中:
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
html_nodes("p") %>%
html_text() %>%
tibble(value = .) %>%
mutate(value = str_squish(str_replace_all(value, "(\s*\([^\)]+\)|\s*\[[^\]]+\])", ""))) %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1) %>%
mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>%
unnest(value)