使用多个文字定界符在 R 中的数据框中拆分字符串
Splitting a string in a dataframe in R using multiple literal delimiters
我有一个单列地址数据框,如下所示:
ADDRESS
123 Main Street Unit A
456 Main Street Apt 3
789 Main Street Floor 2
我想解析地址以将 Unit/Apt/Floor 信息与街道地址的其余部分分开。有没有一种简单的方法可以做到这一点,一开始就知道分隔符应该是“Unit”、“Apt”和“Floor”?
所需的最终结果将是一个两列的数据框,如下所示:
ADDRESS UNIT
123 Main Street Unit A
456 Main Street Apt 3
789 Main Street Floor 2
我试过使用 tidyr
包中的 separate
,但它只接受(据我所知)一个分隔符参数。因此,可以通过多次调用 separate
来完成此任务,但这看起来很愚蠢。
df <- df %>% tidyr::separate(ADDRESS, into = c("ADDRESS","UNIT"), sep = ' Apt')
# This would need to repeated using ' Unit' and ' Floor'.
同样,似乎 stringr::str_split_fixed()
应该能够处理这个任务,但我还是想不出如何通过一次调用完成这个过程(即一次指定三个分隔符)。
stringr::str_split_fixed(df$Address, c(' Unit', ' Apt', ' Floor'), 2)
# Does not work! Additionally does not result in additional column in dataframe as desired.
这是创建示例数据框的代码:
library(dplyr) # for piping
library(tidyr)
library(stringr)
df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))
这个有用吗:
使用基数 R:
gsub('(\d+\sMain Street\s)(.*)','\2',df$ADDRESS)
[1] "Unit A" "Apt 3" "Floor 2"
使用 dplyr 和 stringr:
library(dplyr)
library(stringr)
df %>% mutate(UNIT = str_extract(ADDRESS, '(?<=Main Street ).*'))
ADDRESS UNIT
1 123 Main Street Unit A Unit A
2 456 Main Street Apt 3 Apt 3
3 789 Main Street Floor 2 Floor 2
使用 tidyr::separate
你可以:
library(tidyr)
df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))
df %>%
separate(ADDRESS, sep = "\s(?=Unit|Apt|Floor)", into = c("address", "unit"))
#> address unit
#> 1 123 Main Street Unit A
#> 2 456 Main Street Apt 3
#> 3 789 Main Street Floor 2
这对基础 R 也有帮助:
df$UNIT <- trimws(regmatches(df$ADDRESS, regexpr("\d+\s+Main\s+Street\K(.*)", df$ADDRESS, perl = TRUE)))
ADDRESS UNIT
1 123 Main Street Unit A Unit A
2 456 Main Street Apt 3 Apt 3
3 789 Main Street Floor 2 Floor 2
我有一个单列地址数据框,如下所示:
ADDRESS
123 Main Street Unit A
456 Main Street Apt 3
789 Main Street Floor 2
我想解析地址以将 Unit/Apt/Floor 信息与街道地址的其余部分分开。有没有一种简单的方法可以做到这一点,一开始就知道分隔符应该是“Unit”、“Apt”和“Floor”?
所需的最终结果将是一个两列的数据框,如下所示:
ADDRESS UNIT
123 Main Street Unit A
456 Main Street Apt 3
789 Main Street Floor 2
我试过使用 tidyr
包中的 separate
,但它只接受(据我所知)一个分隔符参数。因此,可以通过多次调用 separate
来完成此任务,但这看起来很愚蠢。
df <- df %>% tidyr::separate(ADDRESS, into = c("ADDRESS","UNIT"), sep = ' Apt')
# This would need to repeated using ' Unit' and ' Floor'.
同样,似乎 stringr::str_split_fixed()
应该能够处理这个任务,但我还是想不出如何通过一次调用完成这个过程(即一次指定三个分隔符)。
stringr::str_split_fixed(df$Address, c(' Unit', ' Apt', ' Floor'), 2)
# Does not work! Additionally does not result in additional column in dataframe as desired.
这是创建示例数据框的代码:
library(dplyr) # for piping
library(tidyr)
library(stringr)
df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))
这个有用吗:
使用基数 R:
gsub('(\d+\sMain Street\s)(.*)','\2',df$ADDRESS)
[1] "Unit A" "Apt 3" "Floor 2"
使用 dplyr 和 stringr:
library(dplyr)
library(stringr)
df %>% mutate(UNIT = str_extract(ADDRESS, '(?<=Main Street ).*'))
ADDRESS UNIT
1 123 Main Street Unit A Unit A
2 456 Main Street Apt 3 Apt 3
3 789 Main Street Floor 2 Floor 2
使用 tidyr::separate
你可以:
library(tidyr)
df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))
df %>%
separate(ADDRESS, sep = "\s(?=Unit|Apt|Floor)", into = c("address", "unit"))
#> address unit
#> 1 123 Main Street Unit A
#> 2 456 Main Street Apt 3
#> 3 789 Main Street Floor 2
这对基础 R 也有帮助:
df$UNIT <- trimws(regmatches(df$ADDRESS, regexpr("\d+\s+Main\s+Street\K(.*)", df$ADDRESS, perl = TRUE)))
ADDRESS UNIT
1 123 Main Street Unit A Unit A
2 456 Main Street Apt 3 Apt 3
3 789 Main Street Floor 2 Floor 2