将字符串分成单独的列 R
Break Apart a String into Separate Columns R
我正在尝试整理一些数据,这些数据全部包含在名为“game_info”的 1 列中作为字符串。此数据包含即将到来的大学篮球比赛数据,包括日期、时间、球队 ID、球队名称等。理想情况下,每一个都应该是它们自己的列。我曾尝试使用 space 分隔符进行分隔,但效果不佳,因为有些团队(例如“Duke”)的名称中有 1 个部分,而其他团队的名称中有 2 到 3 个部分(密歇根州,南达科他州等)。还有名字中带有“-”破折号的球队。
这是我的数据:
df <- data.frame(list(
game_info = c(
"12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford"
)
))
期望的输出:
date time away_team_id away_team_name home_team_id home_team_name
12/16 7:00 PM 751 Appalachian State 752 Duke
12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
12/16 10:00 PM 757 Dartmouth 758 Stanford
@Jonny Phelps @doRemy
这是一个带有正则表达式的。有关正则表达式的解释,请参阅 regex101 link
regex <- "^(\d{2}\/\d{2})\s*(\d{1,2}:\d{2}\s*(PM|AM))\s*(\d+)\s*([^\d.]+)(\d+)\s*([^\d.]+)$"
data <- data.frame(game_info=
"12/16 7:00 PM 751 Appalachian State 752 Duke"
,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)
out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
"home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out
解释:
^(\d{2}/\d{2}) - 以 2 digits/2 数字开头,例如 12/16。 ^ 是一个开始锚点和 () 用来表示我们要捕获这个组用于拔出
\s* - 0 或更多 spaces 在我们的第一组和下一个
之间
(\d{1,2}:\d{2}\s*(PM|AM)) - 想要 1 位或 2 位数字:2 位数字,然后可能是 space 和 PM 或 AM
\s*(\d+)\s* - spaces左右任意位数,第一个id
([^\d.]+) - 所有非数字字符。如果您的团队名称中有数字,这将下降。如果是这样,找到一些例子,我们可以改进它。白色 space 之后被捕获,因此稍后用 trimws
移除
(\d+)\s* - 第二个 id 和 spaces
([^\d.]+)$ - 最后是对方队名和结束句anchor
一种简单的方法是使用 dplyr
库中的 extract
和正则表达式:
# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
"([0-9]{1,2}[/][0-9]{1,2})", # The date
"([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
"([0-9]+)", # The away team id
"([A-Za-z -]+)", # The away team name
"([0-9]+)", # The home team id
"([A-Za-z -]+)" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)
您可以尝试此解决方案,只需要与 [:digit:]
进行简单的模式匹配。一个额外的要求是在开头有日期和时间,在数字 ID 之间有角色团队信息。
此外,您可以在拆分列表 dspl
上使用 trimws
来删除不需要的TAB 或类似的。
数据
dat <- structure(list(game_info = c("12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue", "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford")), class = "data.frame", row.names = c(NA,
-4L))
dspl <- strsplit( dat$game_info, " +" )
dat_tmp <- cbind( date=as.vector(sapply( dspl, function(x) x[1] )),
time=unlist( lapply( dspl, function(x) paste( x[2:3], collapse=" " ) ) ),
away_team_id=as.vector( sapply( dspl, function(x) x[4] ) ) )
data.frame( dat_tmp,
away_team_name=sapply( dspl, function(x)
paste(x[ tail( head( grep( "[[:digit:]]", x )[3]:grep( "[[:digit:]]", x )[4], -1 ), -1 ) ], collapse=" ") ),
home_team_id=sapply( dspl, function(x)
x[ max( grep( "[[:digit:]]", x ) )] ),
home_team_name=sapply( dspl, function(x)
paste( tail( x[ max( grep( "[[:digit:]]", x ) ):length(x)], -1), collapse=" " ) ) )
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian State 752 Duke
2 12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford
这是另一种方法:
library(dplyr)
library(stringr)
library(tidyr)
my_pattern <- "\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm]))"
df %>%
mutate(date = substr(game_info, 1,5),
time = str_extract(game_info, my_pattern),
helper = str_remove(game_info, my_pattern), .keep="unused") %>%
mutate(helper = str_squish(str_remove(helper, substr(helper, 1,5)))) %>%
separate(helper, c("away_team_id", "away_team_name"), sep = '\s', remove = FALSE) %>%
mutate(home_team_id = str_extract_all(helper, '(\d+)(?!.*\d)'),
home_team_name = sub(".*\s", "", helper), .keep="unused")
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian 752 Duke
2 12/16 7:00 PM 753 Chicago 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford
你可以使用{unglue} :
unglue::unglue_unnest(
df, game_info,
"{date} {hour} {away_team_id=\d+} {away_team_name} {home_team_id=\d+} {home_team_name}", convert = TRUE)
#> date hour away_team_id away_team_name home_team_id home_team_name
#> 1 12/16 7:00 PM 751 Appalachian State 752 Duke
#> 2 12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
#> 3 12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
#> 4 12/16 10:00 PM 757 Dartmouth 758 Stanford
由 reprex package (v2.0.1)
于 2021-12-17 创建
为了正确解析它,我们必须提供一些正则表达式信息,而 unglue 将“猜测”其余部分,如果我们只是告诉 unglue ID 必须是数字就足够了。 {away_team_name}
等同于 {away_team_name=.*?}
。 convert = TRUE
会将 ID 放在数字列而不是文本中。
我正在尝试整理一些数据,这些数据全部包含在名为“game_info”的 1 列中作为字符串。此数据包含即将到来的大学篮球比赛数据,包括日期、时间、球队 ID、球队名称等。理想情况下,每一个都应该是它们自己的列。我曾尝试使用 space 分隔符进行分隔,但效果不佳,因为有些团队(例如“Duke”)的名称中有 1 个部分,而其他团队的名称中有 2 到 3 个部分(密歇根州,南达科他州等)。还有名字中带有“-”破折号的球队。
这是我的数据:
df <- data.frame(list(
game_info = c(
"12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford"
)
))
期望的输出:
date time away_team_id away_team_name home_team_id home_team_name
12/16 7:00 PM 751 Appalachian State 752 Duke
12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
12/16 10:00 PM 757 Dartmouth 758 Stanford
@Jonny Phelps @doRemy
这是一个带有正则表达式的。有关正则表达式的解释,请参阅 regex101 link
regex <- "^(\d{2}\/\d{2})\s*(\d{1,2}:\d{2}\s*(PM|AM))\s*(\d+)\s*([^\d.]+)(\d+)\s*([^\d.]+)$"
data <- data.frame(game_info=
"12/16 7:00 PM 751 Appalachian State 752 Duke"
,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)
out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
"home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out
解释:
^(\d{2}/\d{2}) - 以 2 digits/2 数字开头,例如 12/16。 ^ 是一个开始锚点和 () 用来表示我们要捕获这个组用于拔出
\s* - 0 或更多 spaces 在我们的第一组和下一个
之间(\d{1,2}:\d{2}\s*(PM|AM)) - 想要 1 位或 2 位数字:2 位数字,然后可能是 space 和 PM 或 AM
\s*(\d+)\s* - spaces左右任意位数,第一个id
([^\d.]+) - 所有非数字字符。如果您的团队名称中有数字,这将下降。如果是这样,找到一些例子,我们可以改进它。白色 space 之后被捕获,因此稍后用 trimws
移除(\d+)\s* - 第二个 id 和 spaces
([^\d.]+)$ - 最后是对方队名和结束句anchor
一种简单的方法是使用 dplyr
库中的 extract
和正则表达式:
# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
"([0-9]{1,2}[/][0-9]{1,2})", # The date
"([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
"([0-9]+)", # The away team id
"([A-Za-z -]+)", # The away team name
"([0-9]+)", # The home team id
"([A-Za-z -]+)" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)
您可以尝试此解决方案,只需要与 [:digit:]
进行简单的模式匹配。一个额外的要求是在开头有日期和时间,在数字 ID 之间有角色团队信息。
此外,您可以在拆分列表 dspl
上使用 trimws
来删除不需要的TAB 或类似的。
数据
dat <- structure(list(game_info = c("12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue", "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford")), class = "data.frame", row.names = c(NA,
-4L))
dspl <- strsplit( dat$game_info, " +" )
dat_tmp <- cbind( date=as.vector(sapply( dspl, function(x) x[1] )),
time=unlist( lapply( dspl, function(x) paste( x[2:3], collapse=" " ) ) ),
away_team_id=as.vector( sapply( dspl, function(x) x[4] ) ) )
data.frame( dat_tmp,
away_team_name=sapply( dspl, function(x)
paste(x[ tail( head( grep( "[[:digit:]]", x )[3]:grep( "[[:digit:]]", x )[4], -1 ), -1 ) ], collapse=" ") ),
home_team_id=sapply( dspl, function(x)
x[ max( grep( "[[:digit:]]", x ) )] ),
home_team_name=sapply( dspl, function(x)
paste( tail( x[ max( grep( "[[:digit:]]", x ) ):length(x)], -1), collapse=" " ) ) )
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian State 752 Duke
2 12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford
这是另一种方法:
library(dplyr)
library(stringr)
library(tidyr)
my_pattern <- "\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm]))"
df %>%
mutate(date = substr(game_info, 1,5),
time = str_extract(game_info, my_pattern),
helper = str_remove(game_info, my_pattern), .keep="unused") %>%
mutate(helper = str_squish(str_remove(helper, substr(helper, 1,5)))) %>%
separate(helper, c("away_team_id", "away_team_name"), sep = '\s', remove = FALSE) %>%
mutate(home_team_id = str_extract_all(helper, '(\d+)(?!.*\d)'),
home_team_name = sub(".*\s", "", helper), .keep="unused")
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian 752 Duke
2 12/16 7:00 PM 753 Chicago 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford
你可以使用{unglue} :
unglue::unglue_unnest(
df, game_info,
"{date} {hour} {away_team_id=\d+} {away_team_name} {home_team_id=\d+} {home_team_name}", convert = TRUE)
#> date hour away_team_id away_team_name home_team_id home_team_name
#> 1 12/16 7:00 PM 751 Appalachian State 752 Duke
#> 2 12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
#> 3 12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
#> 4 12/16 10:00 PM 757 Dartmouth 758 Stanford
由 reprex package (v2.0.1)
于 2021-12-17 创建为了正确解析它,我们必须提供一些正则表达式信息,而 unglue 将“猜测”其余部分,如果我们只是告诉 unglue ID 必须是数字就足够了。 {away_team_name}
等同于 {away_team_name=.*?}
。 convert = TRUE
会将 ID 放在数字列而不是文本中。