根据 R 中字符的自然格式拆分列
Splitting a Column According to the Natural Format of its Characters in R
我有以下数据框:
library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]
我想制作一个新的数据框,将 pbp.201702050atl.d$Location 分成两列,这样字符元素构成一列,数字元素构成另一列,如下所示:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "" "50"
...
为此,我写了:
Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))
虽然接近我的期望,但此函数导致:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "50" ""
...
通知位置.201702050atl[10,]。此函数仅在 Location.201702050atl$V2 中放置字符,如果对于该行,原始列由两组字符组成,并用 space 分隔。相反,我想在 Location.201702050atl$V1 中放置相似(文本)字符,在 Location.201702050atl$V2 中放置相似(数字)字符。当整个列实际上必须采用相同的格式时,无论其组成字符的自然格式如何,如何根据其字符的自然格式拆分一列的元素?非常感谢您的帮助,谢谢。
如果我没理解错的话,也许这对你有帮助
library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
C1
1: 38XT
2: 29XL
3: 24XH
4: 14SC
5: 34SY
6: 80WB
7: 23VB
8: 23WR
9: 19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\d]", "", C1, perl = T), C1_2 = gsub("[^\d]", "", C1, perl = T)) ]
DT
C1 C1_1 C1_2
1: 38XT XT 38
2: 29XL XL 29
3: 24XH XH 24
4: 14SC SC 14
5: 34SY SY 34
6: 80WB WB 80
7: 23VB VB 23
8: 23WR WR 23
9: 19KJ KJ 19
10: 84 ME ME 84
如果您需要删除原来的列,您可以
DT[, C1:=NULL]
请注意,此正则表达式将删除第一个中的所有数字,以及第二个中的所有非数字。这不会考虑顺序。例如,D7M8
、return、DM
和 78
。
我有以下数据框:
library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]
我想制作一个新的数据框,将 pbp.201702050atl.d$Location 分成两列,这样字符元素构成一列,数字元素构成另一列,如下所示:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "" "50"
...
为此,我写了:
Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))
虽然接近我的期望,但此函数导致:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "50" ""
...
通知位置.201702050atl[10,]。此函数仅在 Location.201702050atl$V2 中放置字符,如果对于该行,原始列由两组字符组成,并用 space 分隔。相反,我想在 Location.201702050atl$V1 中放置相似(文本)字符,在 Location.201702050atl$V2 中放置相似(数字)字符。当整个列实际上必须采用相同的格式时,无论其组成字符的自然格式如何,如何根据其字符的自然格式拆分一列的元素?非常感谢您的帮助,谢谢。
如果我没理解错的话,也许这对你有帮助
library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
C1
1: 38XT
2: 29XL
3: 24XH
4: 14SC
5: 34SY
6: 80WB
7: 23VB
8: 23WR
9: 19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\d]", "", C1, perl = T), C1_2 = gsub("[^\d]", "", C1, perl = T)) ]
DT
C1 C1_1 C1_2
1: 38XT XT 38
2: 29XL XL 29
3: 24XH XH 24
4: 14SC SC 14
5: 34SY SY 34
6: 80WB WB 80
7: 23VB VB 23
8: 23WR WR 23
9: 19KJ KJ 19
10: 84 ME ME 84
如果您需要删除原来的列,您可以
DT[, C1:=NULL]
请注意,此正则表达式将删除第一个中的所有数字,以及第二个中的所有非数字。这不会考虑顺序。例如,D7M8
、return、DM
和 78
。