如何使用 tidyverse/regex 识别 R 中包含非字母字符的行
How to identify rows that contain non-letter characters in R using tidyverse/regex
我有一个数据框,其中包含表示 'Full Name' 的字符串。有些是完整的、正常的全名,有些不是 'complete' 或 'accurate',因为存在非字母字符。
数据帧示例:
Full name
----------
Mikki Clancy
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
Clifton Gurlen
G�mez, Oscar Prieto
Sj�¶strand, Anders
Lisa Terry
Meloy, Wilson {old}
Gregory Stevens
Charles Gruenberg
df <- structure(list(Full_name = c("Jane Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clif Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Liza Terry",
"Meloy, Will {old}",
"Garret Stevens",
"Charly Ruenberg"), Group = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")), class = "data.frame", row.names = c(NA, -10L))
问题是根据包含非 ascii 字符的字符串(例如来自上述值 - '{}、()、&、�')对完整数据帧进行子集化。
所需的输出将是包含这些字符的名称列,然后是总行数,以便我可以计算完整数据帧中 'not complete' 或 'accurate' 的百分比。
Not Complete Full name
----------------------
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
G�mez, Oscar Prieto
Sj�¶strand, Anders
Meloy, Wilson {old}
我们可以使用str_detect
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Full_name, "[^A-Za-z, ]+"))
Full_name Group
1 Hermsdorfer, Mark (retired) b
2 CSP, PSECU Lan Unit (typo) c
3 G�mez, Oscar Prieto e
4 Sj�¶strand, Anders f
5 Meloy, Will {old} h
为了更全面地了解字母,我从 this question about matching letters 那里借用了正则表达式。
library(dplyr)
df %>% mutate(
has_non_letters = grepl("[^\p{L} ]", df$names, perl = TRUE)
)
# names has_non_letters
# 1 Mikki Clancy FALSE
# 2 Hermsdorfer, Mark (retired) TRUE
# 3 CSP, PSECU Lan Unit (typo) TRUE
# 4 Clifton Gurlen FALSE
# 5 G<U+FFFD>mez, Oscar Prieto TRUE
# 6 Sj�¶strand, Anders TRUE
# 7 Lisa Terry FALSE
# 8 Meloy, Wilson {old} TRUE
# 9 Gregory Stevens FALSE
# 10 Charles Gruenberg FALSE
我会给你留下额外的总结 - 你 sum
或 mean
TRUE/FALSE 值,你喜欢什么。
使用此数据:
df = data.frame(names = c(
"Mikki Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clifton Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Lisa Terry",
"Meloy, Wilson {old}",
"Gregory Stevens",
"Charles Gruenberg"
))
我有一个数据框,其中包含表示 'Full Name' 的字符串。有些是完整的、正常的全名,有些不是 'complete' 或 'accurate',因为存在非字母字符。
数据帧示例:
Full name
----------
Mikki Clancy
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
Clifton Gurlen
G�mez, Oscar Prieto
Sj�¶strand, Anders
Lisa Terry
Meloy, Wilson {old}
Gregory Stevens
Charles Gruenberg
df <- structure(list(Full_name = c("Jane Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clif Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Liza Terry",
"Meloy, Will {old}",
"Garret Stevens",
"Charly Ruenberg"), Group = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")), class = "data.frame", row.names = c(NA, -10L))
问题是根据包含非 ascii 字符的字符串(例如来自上述值 - '{}、()、&、�')对完整数据帧进行子集化。
所需的输出将是包含这些字符的名称列,然后是总行数,以便我可以计算完整数据帧中 'not complete' 或 'accurate' 的百分比。
Not Complete Full name
----------------------
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
G�mez, Oscar Prieto
Sj�¶strand, Anders
Meloy, Wilson {old}
我们可以使用str_detect
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Full_name, "[^A-Za-z, ]+"))
Full_name Group
1 Hermsdorfer, Mark (retired) b
2 CSP, PSECU Lan Unit (typo) c
3 G�mez, Oscar Prieto e
4 Sj�¶strand, Anders f
5 Meloy, Will {old} h
为了更全面地了解字母,我从 this question about matching letters 那里借用了正则表达式。
library(dplyr)
df %>% mutate(
has_non_letters = grepl("[^\p{L} ]", df$names, perl = TRUE)
)
# names has_non_letters
# 1 Mikki Clancy FALSE
# 2 Hermsdorfer, Mark (retired) TRUE
# 3 CSP, PSECU Lan Unit (typo) TRUE
# 4 Clifton Gurlen FALSE
# 5 G<U+FFFD>mez, Oscar Prieto TRUE
# 6 Sj�¶strand, Anders TRUE
# 7 Lisa Terry FALSE
# 8 Meloy, Wilson {old} TRUE
# 9 Gregory Stevens FALSE
# 10 Charles Gruenberg FALSE
我会给你留下额外的总结 - 你 sum
或 mean
TRUE/FALSE 值,你喜欢什么。
使用此数据:
df = data.frame(names = c(
"Mikki Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clifton Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Lisa Terry",
"Meloy, Wilson {old}",
"Gregory Stevens",
"Charles Gruenberg"
))