如何删除数据框列中不以一定数量数字开头的文本
How to delete text that does not start with a certain amount of numbers in dataframe column
我有这个:
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US
我需要它是这样的:
col1 col2 col3
1234HO 1234 HO
9535KU 9535 KU
4532SP 4532 SP
#these rows still need to be there
1206 1206 #keep in mind that I still want to keep this if there is 4 numbers
9530OK 9530 OK
8524US 8524 US
我尝试手动删除它,但工作量太大了。
我不确定如何制作一个“删除所有不以 4 个数字开头的文本”的函数。如果它们都是相同的数字,我才知道怎么做,但它们可以是任何数字。
您可以使用 tidyr::separate
然后 filter
。
library(dplyr)
library(tidyr)
dat %>%
separate(col1, into = c("num", "text"), sep = "(?<=[0-9])(?=[A-Za-z])", remove = F) %>%
filter(!grepl("[A-Za-z]", num) & nchar(num) > 3)
col1 num text
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4 1206 1206 <NA>
5 9530OK 9530 OK
6 8524US 8524 US
这是一个 dplyr
正则表达式级别非常低的方法。
输入
# your df
df <- read.table(header = T, text = "
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US")
空行
library(dplyr)
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
col3 = replace_na(col3, ""),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, print(""))))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4
5
6
7 1206 1206
8 9530OK 9530 OK
9
10 8524US 8524 US
行数为 NA
# if you want them to be filled with NA
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, NA)))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
7 1206 1206 <NA>
8 9530OK 9530 OK
9 <NA> <NA> <NA>
10 8524US 8524 US
另一个可能的解决方案:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
col1 = c("1234HO","9535KU",
"4532SP","1","hello","xyz","1206","9530OK","23",
"8524US")
)
df %>%
separate(col1, into=str_c("col", 2:3), sep="(?<=\d{4})",
remove = F, fill = "right") %>% filter(!is.na(col3))
#> col1 col2 col3
#> 1 1234HO 1234 HO
#> 2 9535KU 9535 KU
#> 3 4532SP 4532 SP
#> 4 1206 1206
#> 5 9530OK 9530 OK
#> 6 8524US 8524 US
我有这个:
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US
我需要它是这样的:
col1 col2 col3
1234HO 1234 HO
9535KU 9535 KU
4532SP 4532 SP
#these rows still need to be there
1206 1206 #keep in mind that I still want to keep this if there is 4 numbers
9530OK 9530 OK
8524US 8524 US
我尝试手动删除它,但工作量太大了。 我不确定如何制作一个“删除所有不以 4 个数字开头的文本”的函数。如果它们都是相同的数字,我才知道怎么做,但它们可以是任何数字。
您可以使用 tidyr::separate
然后 filter
。
library(dplyr)
library(tidyr)
dat %>%
separate(col1, into = c("num", "text"), sep = "(?<=[0-9])(?=[A-Za-z])", remove = F) %>%
filter(!grepl("[A-Za-z]", num) & nchar(num) > 3)
col1 num text
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4 1206 1206 <NA>
5 9530OK 9530 OK
6 8524US 8524 US
这是一个 dplyr
正则表达式级别非常低的方法。
输入
# your df
df <- read.table(header = T, text = "
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US")
空行
library(dplyr)
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
col3 = replace_na(col3, ""),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, print(""))))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4
5
6
7 1206 1206
8 9530OK 9530 OK
9
10 8524US 8524 US
行数为 NA
# if you want them to be filled with NA
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, NA)))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
7 1206 1206 <NA>
8 9530OK 9530 OK
9 <NA> <NA> <NA>
10 8524US 8524 US
另一个可能的解决方案:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
col1 = c("1234HO","9535KU",
"4532SP","1","hello","xyz","1206","9530OK","23",
"8524US")
)
df %>%
separate(col1, into=str_c("col", 2:3), sep="(?<=\d{4})",
remove = F, fill = "right") %>% filter(!is.na(col3))
#> col1 col2 col3
#> 1 1234HO 1234 HO
#> 2 9535KU 9535 KU
#> 3 4532SP 4532 SP
#> 4 1206 1206
#> 5 9530OK 9530 OK
#> 6 8524US 8524 US