如何使用 tidyverse/regex 识别 R 中包含非字母字符的行

How to identify rows that contain non-letter characters in R using tidyverse/regex

我有一个数据框,其中包含表示 'Full Name' 的字符串。有些是完整的、正常的全名,有些不是 'complete' 或 'accurate',因为存在非字母字符。

数据帧示例:

Full name
----------

Mikki Clancy
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
Clifton Gurlen
G�mez, Oscar Prieto
Sj�¶strand, Anders
Lisa Terry
Meloy, Wilson {old}
Gregory Stevens
Charles Gruenberg

df <- structure(list(Full_name = c("Jane Clancy",
                                       "Hermsdorfer, Mark (retired)",
                                       "CSP, PSECU Lan Unit (typo)",
                                       "Clif Gurlen",
                                       "G�mez, Oscar Prieto",
                                       "Sj�¶strand, Anders",
                                       "Liza Terry",
                                       "Meloy, Will {old}",
                                       "Garret Stevens",
                                       "Charly Ruenberg"), Group = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")), class = "data.frame", row.names = c(NA, -10L))

问题是根据包含非 ascii 字符的字符串(例如来自上述值 - '{}、()、&、�')对完整数据帧进行子集化。

所需的输出将是包含这些字符的名称列,然后是总行数,以便我可以计算完整数据帧中 'not complete' 或 'accurate' 的百分比。

Not Complete Full name
----------------------

Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
G�mez, Oscar Prieto
Sj�¶strand, Anders
Meloy, Wilson {old}

我们可以使用str_detect

library(dplyr)
library(stringr)
df %>% 
   filter(str_detect(Full_name, "[^A-Za-z, ]+"))
                    Full_name Group
1 Hermsdorfer, Mark (retired)     b
2  CSP, PSECU Lan Unit (typo)     c
3         G�mez, Oscar Prieto     e
4        Sj�¶strand, Anders     f
5           Meloy, Will {old}     h

为了更全面地了解字母,我从 this question about matching letters 那里借用了正则表达式。

library(dplyr)
df %>% mutate(
  has_non_letters = grepl("[^\p{L} ]", df$names, perl = TRUE)
)
#                          names has_non_letters
# 1                 Mikki Clancy           FALSE
# 2  Hermsdorfer, Mark (retired)            TRUE
# 3   CSP, PSECU Lan Unit (typo)            TRUE
# 4               Clifton Gurlen           FALSE
# 5   G<U+FFFD>mez, Oscar Prieto            TRUE
# 6         Sj�¶strand, Anders            TRUE
# 7                   Lisa Terry           FALSE
# 8          Meloy, Wilson {old}            TRUE
# 9              Gregory Stevens           FALSE
# 10           Charles Gruenberg           FALSE

我会给你留下额外的总结 - 你 summean TRUE/FALSE 值,你喜欢什么。


使用此数据:

df = data.frame(names = c(
"Mikki Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clifton Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Lisa Terry",
"Meloy, Wilson {old}",
"Gregory Stevens",
"Charles Gruenberg"
))