根据 R 中字符的自然格式拆分列

Splitting a Column According to the Natural Format of its Characters in R

我有以下数据框:

library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]

我想制作一个新的数据框,将 pbp.201702050atl.d$Location 分成两列,这样字符元素构成一列,数字元素构成另一列,如下所示:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   ""    "50"
...

为此,我写了:

Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))

虽然接近我的期望,但此函数导致:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   "50"  ""
...

通知位置.201702050atl[10,]。此函数仅在 Location.201702050atl$V2 中放置字符,如果对于该行,原始列由两组字符组成,并用 space 分隔。相反,我想在 Location.201702050atl$V1 中放置相似(文本)字符,在 Location.201702050atl$V2 中放置相似(数字)字符。当整个列实际上必须采用相同的格式时,无论其组成字符的自然格式如何,如何根据其字符的自然格式拆分一列的元素?非常感谢您的帮助,谢谢。

如果我没理解错的话,也许这对你有帮助

library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
    C1
 1:  38XT
 2:  29XL
 3:  24XH
 4:  14SC
 5:  34SY
 6:  80WB
 7:  23VB
 8:  23WR
 9:  19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\d]", "", C1, perl = T), C1_2 = gsub("[^\d]", "", C1, perl = T)) ]
DT
       C1 C1_1 C1_2
 1:  38XT   XT   38
 2:  29XL   XL   29
 3:  24XH   XH   24
 4:  14SC   SC   14
 5:  34SY   SY   34
 6:  80WB   WB   80
 7:  23VB   VB   23
 8:  23WR   WR   23
 9:  19KJ   KJ   19
10: 84 ME   ME   84

如果您需要删除原来的列,您可以

DT[, C1:=NULL]

请注意,此正则表达式将删除第一个中的所有数字,以及第二个中的所有非数字。这不会考虑顺序。例如,D7M8、return、DM78