R拆分字符串并保留部分
R split string and keep section
我有一个包含橄榄球比赛首发阵容(从网络上提取)的字符串,它看起来像这样:
"Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
我想要的基本上是一个有两列的table,一列是球员的号码,另一列是球员的名字。例如
position name
1 Joe Moody
2 Codie Taylor
3 Owen Franks
4 Scott Barrett
... ...
所有玩家。
我试过使用 strsplit
,按 ","
拆分但是问题变成了第一个玩家:
"Crusaders: 15 David Havili"
数字1和16合并
"1 Joe MoodyReplacements: 16 Sam Anderson-Heather".
有什么想法吗?
同意@HongOoi 的观点;最好退后一步,确保以更明智的方式导入数据。也就是说,这是一个 post-hoc hacky 解决方案。不确定这概括得有多好,如果有的话。
ss <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
library(tidyverse)
data.frame(ss = ss) %>%
mutate(ss = str_replace(ss, "Replacements", "")) %>% # Remove "Replacements"
mutate(ss = str_split(ss, "(,|:) ")) %>% # Split on "," or ":"
unnest() %>%
separate(ss, c("position", "name"), sep = "(?<=\d)\s", fill = "right") %>%
filter(!is.na(name)) # Remove the first "Crusaders" line
# position name
#1 15 David Havili
#2 14 Seta Tamanivalu
#3 13 Jack Goodhue
#4 12 Ryan Crotty
#5 11 George Bridge
#6 10 Richie Mo’unga
#7 9 Bryn Hall
#8 8 Kieran Read
#9 7 Matt Todd
#10 6 Heiden Bedwell-Curtis
#11 5 Sam Whitelock (c)
#12 4 Scott Barrett
#13 3 Owen Franks
#14 2 Codie Taylor
#15 1 Joe Moody
#16 16 Sam Anderson-Heather
#17 17 Tim Perry
#18 18 Michael Alaalatoa
#19 19 Luke Romano
#20 20 Pete Samu
#21 21 Mitchell Drummond
#22 22 Mitchell Hunt
#23 23 Braydon Ennor
这是一种适用于您的示例字符串的快速而肮脏的方法。如果开头缺少团队名称,它将不适用于其他字符串。
player.string <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
df <- read.table(text = gsub("(\d+)", "\1\t", gsub("Replacements:|(^[^:]*:)|, ", "\n", player.string)), header = FALSE, sep = "\t", col.names = c("Number", "Name"))
df[order(df$Number),]
Number Name
15 1 Joe Moody
14 2 Codie Taylor
13 3 Owen Franks
12 4 Scott Barrett
11 5 Sam Whitelock (c)
10 6 Heiden Bedwell-Curtis
9 7 Matt Todd
8 8 Kieran Read
7 9 Bryn Hall
...
使用 stringr::str_match_all() 和一些正则表达式,您可以找到并提取所有匹配项,注意使用非贪婪 (?) 运算符和没有逗号的匹配行尾:
library(dplyr)
library(stringr)
ea <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
ea <- unlist(strsplit(ea, "Replacements: "))
tibble(jersey = str_match_all(ea, "\d+") %>% unlist(),
player = str_match_all(ea, "(?<=\d\s).*?(?=.$|,)") %>% unlist())
# A tibble: 23 x 2
jersey player
<chr> <chr>
1 15 David Havili
2 14 Seta Tamanivalu
3 13 Jack Goodhue
4 12 Ryan Crotty
5 11 George Bridge
我有一个包含橄榄球比赛首发阵容(从网络上提取)的字符串,它看起来像这样:
"Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
我想要的基本上是一个有两列的table,一列是球员的号码,另一列是球员的名字。例如
position name
1 Joe Moody
2 Codie Taylor
3 Owen Franks
4 Scott Barrett
... ...
所有玩家。
我试过使用 strsplit
,按 ","
拆分但是问题变成了第一个玩家:
"Crusaders: 15 David Havili"
数字1和16合并
"1 Joe MoodyReplacements: 16 Sam Anderson-Heather".
有什么想法吗?
同意@HongOoi 的观点;最好退后一步,确保以更明智的方式导入数据。也就是说,这是一个 post-hoc hacky 解决方案。不确定这概括得有多好,如果有的话。
ss <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
library(tidyverse)
data.frame(ss = ss) %>%
mutate(ss = str_replace(ss, "Replacements", "")) %>% # Remove "Replacements"
mutate(ss = str_split(ss, "(,|:) ")) %>% # Split on "," or ":"
unnest() %>%
separate(ss, c("position", "name"), sep = "(?<=\d)\s", fill = "right") %>%
filter(!is.na(name)) # Remove the first "Crusaders" line
# position name
#1 15 David Havili
#2 14 Seta Tamanivalu
#3 13 Jack Goodhue
#4 12 Ryan Crotty
#5 11 George Bridge
#6 10 Richie Mo’unga
#7 9 Bryn Hall
#8 8 Kieran Read
#9 7 Matt Todd
#10 6 Heiden Bedwell-Curtis
#11 5 Sam Whitelock (c)
#12 4 Scott Barrett
#13 3 Owen Franks
#14 2 Codie Taylor
#15 1 Joe Moody
#16 16 Sam Anderson-Heather
#17 17 Tim Perry
#18 18 Michael Alaalatoa
#19 19 Luke Romano
#20 20 Pete Samu
#21 21 Mitchell Drummond
#22 22 Mitchell Hunt
#23 23 Braydon Ennor
这是一种适用于您的示例字符串的快速而肮脏的方法。如果开头缺少团队名称,它将不适用于其他字符串。
player.string <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
df <- read.table(text = gsub("(\d+)", "\1\t", gsub("Replacements:|(^[^:]*:)|, ", "\n", player.string)), header = FALSE, sep = "\t", col.names = c("Number", "Name"))
df[order(df$Number),]
Number Name
15 1 Joe Moody
14 2 Codie Taylor
13 3 Owen Franks
12 4 Scott Barrett
11 5 Sam Whitelock (c)
10 6 Heiden Bedwell-Curtis
9 7 Matt Todd
8 8 Kieran Read
7 9 Bryn Hall
...
使用 stringr::str_match_all() 和一些正则表达式,您可以找到并提取所有匹配项,注意使用非贪婪 (?) 运算符和没有逗号的匹配行尾:
library(dplyr)
library(stringr)
ea <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
ea <- unlist(strsplit(ea, "Replacements: "))
tibble(jersey = str_match_all(ea, "\d+") %>% unlist(),
player = str_match_all(ea, "(?<=\d\s).*?(?=.$|,)") %>% unlist())
# A tibble: 23 x 2
jersey player
<chr> <chr>
1 15 David Havili
2 14 Seta Tamanivalu
3 13 Jack Goodhue
4 12 Ryan Crotty
5 11 George Bridge