通过模式匹配拆分字符列
Splitting a character column by pattern matching
Province ElecDistName Candidate Votes Majority Vper MajPer
<chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Nick Whalen Liberal 20974 646 46.7 1.4
2 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Jack Harris ** NDP-New Democratic Party 20328 NA 45.3 NA
3 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Deanne Stapleton Conservative 2938 NA 6.5 NA
4 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est David Anthony Peters Green Party 500 NA 1.1 NA
5 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Sean Burton Communist 140 NA 0.3 NA
6 New Brunswick/Nouveau-Brunswick Fundy Royal Alaina Lockhart Liberal 19136 1775 40.9 3.8
Top of Dataset
业余问题,我想把候选人列分成两列,一个包含姓名,另一个包含党派。我已经尝试了这里发布的一些单独的功能:
separate(ElecResults, Candidate, into = c("Name", "Party"), sep = " (?=[^ ]+$)")
但这似乎遗漏了很多观察结果。对于三个名字的候选人,问题很明显,但还有其他人似乎完全错过了(一个莫名其妙的双星号的候选人)。
我试过考虑如果函数与 grepl 结合,它会识别最常见的政党名称,例如自由党、保守党、新民主党和绿色党,并创建一个名为 Party 的新列,其中包含党派名称,但每次尝试都会不断收到错误消息。
如果有人知道我如何拆分此专栏,那将是一个巨大的帮助。
谢谢!
这里是使用 dput 的代码:
structure(list(Province = c("Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"New Brunswick/Nouveau-Brunswick"), ElecDistName = c("St. John's East/St. John's-Est",
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est",
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est",
"Fundy Royal"), Candidate = c("Nick Whalen Liberal", "Jack Harris ** NDP-New Democratic Party",
"Deanne Stapleton Conservative", "David Anthony Peters Green Party",
"Sean Burton Communist", "Alaina Lockhart Liberal"), Votes = c(20974L,
20328L, 2938L, 500L, 140L, 19136L), Majority = c(646L, NA, NA,
NA, NA, 1775L), Vper = c(46.7, 45.3, 6.5, 1.1, 0.3, 40.9), MajPer = c(1.4,
NA, NA, NA, NA, 3.8)), .Names = c("Province", "ElecDistName",
"Candidate", "Votes", "Majority", "Vper", "MajPer"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
这是一些基本代码,您需要mod。将各方名称放在由 |
分隔的引号内
require(dplyr)
require(stringr)
df <- data.frame(Candidate = "Nick Whalen Liberal", Majority = 1)
parties <- c("Liberal|Conservative")
df %>% mutate(Name = str_sub(Candidate, 1, str_locate(Candidate, parties)[1] - 1))
这是使用 fuzzyjoin
包
的另一种方法
library(tidyverse)
library(fuzzyjoin)
parties <- data_frame(party = c("Liberal", "NDP-New Democratic Party", "Conservative", "Green Party", "Communist"))
df %>%
regex_left_join(parties, by = c(Candidate = "party")) %>%
replace_na(list(party = "minor")) %>%
mutate(Candidate = str_replace(Candidate, party, "")) %>%
select(Candidate, party)
#> # A tibble: 6 x 2
#> Candidate party
#> <chr> <chr>
#> 1 Nick Whalen Liberal
#> 2 Jack Harris ** NDP-New Democratic Party
#> 3 Deanne Stapleton Conservative
#> 4 David Anthony Peters Green Party
#> 5 Sean Burton Communist
#> 6 Alaina Lockhart Liberal
请注意,添加最后一个 select 只是为了说明该方法有效。我特别喜欢这种方法,因为使用 replace_na
可以很好地处理可能出现在数据框中的其他方
Province ElecDistName Candidate Votes Majority Vper MajPer
<chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Nick Whalen Liberal 20974 646 46.7 1.4
2 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Jack Harris ** NDP-New Democratic Party 20328 NA 45.3 NA
3 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Deanne Stapleton Conservative 2938 NA 6.5 NA
4 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est David Anthony Peters Green Party 500 NA 1.1 NA
5 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Sean Burton Communist 140 NA 0.3 NA
6 New Brunswick/Nouveau-Brunswick Fundy Royal Alaina Lockhart Liberal 19136 1775 40.9 3.8
Top of Dataset
业余问题,我想把候选人列分成两列,一个包含姓名,另一个包含党派。我已经尝试了这里发布的一些单独的功能:
separate(ElecResults, Candidate, into = c("Name", "Party"), sep = " (?=[^ ]+$)")
但这似乎遗漏了很多观察结果。对于三个名字的候选人,问题很明显,但还有其他人似乎完全错过了(一个莫名其妙的双星号的候选人)。
我试过考虑如果函数与 grepl 结合,它会识别最常见的政党名称,例如自由党、保守党、新民主党和绿色党,并创建一个名为 Party 的新列,其中包含党派名称,但每次尝试都会不断收到错误消息。
如果有人知道我如何拆分此专栏,那将是一个巨大的帮助。
谢谢!
这里是使用 dput 的代码:
structure(list(Province = c("Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador",
"New Brunswick/Nouveau-Brunswick"), ElecDistName = c("St. John's East/St. John's-Est",
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est",
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est",
"Fundy Royal"), Candidate = c("Nick Whalen Liberal", "Jack Harris ** NDP-New Democratic Party",
"Deanne Stapleton Conservative", "David Anthony Peters Green Party",
"Sean Burton Communist", "Alaina Lockhart Liberal"), Votes = c(20974L,
20328L, 2938L, 500L, 140L, 19136L), Majority = c(646L, NA, NA,
NA, NA, 1775L), Vper = c(46.7, 45.3, 6.5, 1.1, 0.3, 40.9), MajPer = c(1.4,
NA, NA, NA, NA, 3.8)), .Names = c("Province", "ElecDistName",
"Candidate", "Votes", "Majority", "Vper", "MajPer"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
这是一些基本代码,您需要mod。将各方名称放在由 |
分隔的引号内require(dplyr)
require(stringr)
df <- data.frame(Candidate = "Nick Whalen Liberal", Majority = 1)
parties <- c("Liberal|Conservative")
df %>% mutate(Name = str_sub(Candidate, 1, str_locate(Candidate, parties)[1] - 1))
这是使用 fuzzyjoin
包
library(tidyverse)
library(fuzzyjoin)
parties <- data_frame(party = c("Liberal", "NDP-New Democratic Party", "Conservative", "Green Party", "Communist"))
df %>%
regex_left_join(parties, by = c(Candidate = "party")) %>%
replace_na(list(party = "minor")) %>%
mutate(Candidate = str_replace(Candidate, party, "")) %>%
select(Candidate, party)
#> # A tibble: 6 x 2
#> Candidate party
#> <chr> <chr>
#> 1 Nick Whalen Liberal
#> 2 Jack Harris ** NDP-New Democratic Party
#> 3 Deanne Stapleton Conservative
#> 4 David Anthony Peters Green Party
#> 5 Sean Burton Communist
#> 6 Alaina Lockhart Liberal
请注意,添加最后一个 select 只是为了说明该方法有效。我特别喜欢这种方法,因为使用 replace_na