如何在r中的第n个字符后拆分字符串
How to split a string after the nth character in r
我正在处理以下数据:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
我想在第二个字符之后拆分字符串并将它们放入两列中。
所以数据看起来像这样:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
是否有简单的代码来完成此操作?非常感谢你的帮助
如果您总是想按第二个字符拆分,可以使用 substr
。
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
您可以使用基础 R:
中的 strcapture
strcapture("(\w{2})(\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
其中 \w{2}
表示两个词
我们可以使用 str_match
来捕获前两个字符和单独列中的其余字符串。
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
OP 有
I'm more familiar with strsplit()
. But since there is nothing to split
on, its not applicable in this case
Au contraire! 有一些东西要拆分,它叫做 lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
lookbehind 的工作方式类似于“在 2 个大写字母后插入一个不可见的中断”并在那里拆分字符串。
结果是向量列表
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
它可以变成一个矩阵,例如
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"
将其视为固定宽度文件,并导入:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21
有了 tidyverse
,使用 tidyr
中的函数 separate
非常容易:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
我正在处理以下数据:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
我想在第二个字符之后拆分字符串并将它们放入两列中。
所以数据看起来像这样:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
是否有简单的代码来完成此操作?非常感谢你的帮助
如果您总是想按第二个字符拆分,可以使用 substr
。
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
您可以使用基础 R:
中的strcapture
strcapture("(\w{2})(\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
其中 \w{2}
表示两个词
我们可以使用 str_match
来捕获前两个字符和单独列中的其余字符串。
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
OP 有
I'm more familiar with
strsplit()
. But since there is nothing to split on, its not applicable in this case
Au contraire! 有一些东西要拆分,它叫做 lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
lookbehind 的工作方式类似于“在 2 个大写字母后插入一个不可见的中断”并在那里拆分字符串。
结果是向量列表
[[1]] [1] "AR" "01" [[2]] [1] "AZ" "03" [[3]] [1] "AZ" "05" [[4]] [1] "AZ" "08" [[5]] [1] "CA" "01" [[6]] [1] "CA" "05" [[7]] [1] "CA" "11" [[8]] [1] "CA" "16" [[9]] [1] "CA" "18" [[10]] [1] "CA" "21"
它可以变成一个矩阵,例如
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2] [1,] "AR" "01" [2,] "AZ" "03" [3,] "AZ" "05" [4,] "AZ" "08" [5,] "CA" "01" [6,] "CA" "05" [7,] "CA" "11" [8,] "CA" "16" [9,] "CA" "18" [10,] "CA" "21"
将其视为固定宽度文件,并导入:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21
有了 tidyverse
,使用 tidyr
中的函数 separate
非常容易:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21