使用 R 在第一次出现整数时拆分字符串
Split String at First Occurrence of an Integer using R
注意 我已经读过 Split string at first occurrence of an integer in a string 但是我的要求不同,因为我想使用 R.
假设我有以下示例数据框:
> df = data.frame(name_and_address =
c("Mr. Smith12 Some street",
"Mr. Jones345 Another street",
"Mr. Anderson6 A different street"))
> df
name_and_address
1 Mr. Smith12 Some street
2 Mr. Jones345 Another street
3 Mr. Anderson6 A different street
我想在第一次出现整数时拆分字符串。请注意,整数的长度不同。
所需的输出可以如下所示:
[[1]]
[1] "Mr. Smith"
[2] "12 Some street",
[[2]]
[1] "Mr. Jones"
[2] "345 Another street",
[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
我尝试了以下方法,但无法正确获取正则表达式:
# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\d+)', perl=TRUE, type.convert=TRUE)
# Attempt 2 (Does not work)
library(stringr)
str_split(df, "\d+")
您可以使用 tidyr::extract
:
library(tidyr)
df <- df %>%
extract("name_and_address", c("name", "address"), "(\D*)(\d.*)")
## => df
## name address
## 1 Mr. Smith 12 Some street
## 2 Mr. Jones 345 Another street
## 3 Mr. Anderson 6 A different street
(\D*)(\d.*)
正则表达式匹配以下内容:
(\D*)
- 第 1 组:任意零个或多个 non-digit 个字符
(\d.*)
- 第 2 组:一个数字,然后尽可能多的任意零个或多个字符。
另一种 stringr::str_split
的解决方案也是可能的:
str_split(df$name_and_address, "(?=\d)", n=2)
## => [[1]]
## [1] "Mr. Smith" "12 Some street"
## [[2]]
## [1] "Mr. Jones" "345 Another street"
## [[3]]
## [1] "Mr. Anderson" "6 A different street"
(?=\d)
正向先行查找数字之前的位置,并且 n=2
告诉 stringr::str_split
最多只分成 2 个块。
如果字符串中没有数字,则不会 return 任何内容的基本 R 方法:
df = data.frame(name_and_address = c("Mr. Smith12 Some street", "Mr. Jones345 Another street", "Mr. Anderson6 A different street", "1 digit is at the start", "No digits, sorry."))
df$name <- sub("^(?:(\D*)\d.*|.+)", "\1", df$name_and_address)
df$address <- sub("^\D*(\d.*)?", "\1", df$name_and_address)
df$name
# => [1] "Mr. Smith" "Mr. Jones" "Mr. Anderson" "" ""
df$address
# => [1] "12 Some street" "345 Another street"
# [3] "6 A different street" "1 digit is at the start" ""
参见an online R demo。这也支持第一个数字是字符串中第一个字符的情况。
我会在这里使用 sub
:
df$name <- sub("(\D+).*", "\1", df$name_and_address)
df$address <- sub(".*?(\d+.*)", "\1", df$name_and_address)
注意 我已经读过 Split string at first occurrence of an integer in a string 但是我的要求不同,因为我想使用 R.
假设我有以下示例数据框:
> df = data.frame(name_and_address =
c("Mr. Smith12 Some street",
"Mr. Jones345 Another street",
"Mr. Anderson6 A different street"))
> df
name_and_address
1 Mr. Smith12 Some street
2 Mr. Jones345 Another street
3 Mr. Anderson6 A different street
我想在第一次出现整数时拆分字符串。请注意,整数的长度不同。
所需的输出可以如下所示:
[[1]]
[1] "Mr. Smith"
[2] "12 Some street",
[[2]]
[1] "Mr. Jones"
[2] "345 Another street",
[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
我尝试了以下方法,但无法正确获取正则表达式:
# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\d+)', perl=TRUE, type.convert=TRUE)
# Attempt 2 (Does not work)
library(stringr)
str_split(df, "\d+")
您可以使用 tidyr::extract
:
library(tidyr)
df <- df %>%
extract("name_and_address", c("name", "address"), "(\D*)(\d.*)")
## => df
## name address
## 1 Mr. Smith 12 Some street
## 2 Mr. Jones 345 Another street
## 3 Mr. Anderson 6 A different street
(\D*)(\d.*)
正则表达式匹配以下内容:
(\D*)
- 第 1 组:任意零个或多个 non-digit 个字符(\d.*)
- 第 2 组:一个数字,然后尽可能多的任意零个或多个字符。
另一种 stringr::str_split
的解决方案也是可能的:
str_split(df$name_and_address, "(?=\d)", n=2)
## => [[1]]
## [1] "Mr. Smith" "12 Some street"
## [[2]]
## [1] "Mr. Jones" "345 Another street"
## [[3]]
## [1] "Mr. Anderson" "6 A different street"
(?=\d)
正向先行查找数字之前的位置,并且 n=2
告诉 stringr::str_split
最多只分成 2 个块。
如果字符串中没有数字,则不会 return 任何内容的基本 R 方法:
df = data.frame(name_and_address = c("Mr. Smith12 Some street", "Mr. Jones345 Another street", "Mr. Anderson6 A different street", "1 digit is at the start", "No digits, sorry."))
df$name <- sub("^(?:(\D*)\d.*|.+)", "\1", df$name_and_address)
df$address <- sub("^\D*(\d.*)?", "\1", df$name_and_address)
df$name
# => [1] "Mr. Smith" "Mr. Jones" "Mr. Anderson" "" ""
df$address
# => [1] "12 Some street" "345 Another street"
# [3] "6 A different street" "1 digit is at the start" ""
参见an online R demo。这也支持第一个数字是字符串中第一个字符的情况。
我会在这里使用 sub
:
df$name <- sub("(\D+).*", "\1", df$name_and_address)
df$address <- sub(".*?(\d+.*)", "\1", df$name_and_address)