识别 R 中的 Unicode 替换字符（U+FFFD 或 � 或黑色菱形问号）

Question

我有一个大型 Qualtrics 调查的数据，在我得到它之前在 Stata 中处理过。我现在正在尝试清理 R 中的数据。一些葡萄牙语字符已被替换为 �。我正在尝试标记对一系列问题的文本输入响应，这些问题最初是“não”[英语中的“no”]，现在被记录为“n�o”。我可以在下面的测试中看到 gsub() 和 grepl() 可以识别列表或数据框中的“�”，但是当我尝试对真实数据使用相同的命令时，这两个命令都无法识别“n�” o”，甚至是“�”。没有错误；它只是无法替代 gsub() 并在 grepl() 应该为 TRUE 时标记为 FALSE。

是否有多种基于基本字符的 �？有没有什么方法可以搜索或替换 � 个字符来提取任何实例？

此示例表明 gsub() 和 grepl() 都可以在列表或数据框上正常工作：

list <- c("n�o ç não", "n�o", "nao", "não")
gsub("�", "ã", list)
grepl("�", list)

library(dplyr)
df <- data.frame(list)
df.new <- df %>%
  mutate(
    sub = gsub("�", "ã", df$list),
    replace = grepl("�", list))
df.new$sub
df.new$replace

[1] "nãoç não" "não" "nao" "não"
[1] 真真假假

同样的代码无法识别我的真实数据中的“�”。

Answer 1

我猜您使用的是 windows 机器，它有时无法很好地处理 unicode 字符。要重新创建 im 解析您的实际 post 以向您展示您可以做什么。我建议使用 stringi 库，并使用替换所有你知道 ã 的字符作为捷径，但实际上你想用一个全面的解决方案来处理每个可能的情况.查看 ?stringi-search-charclass 以获取有关如何执行此操作的更多信息，但是..来自您的原始 post:

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().

我们得到：

library(xml2)
library(stringi)
this_post = "
read_html(this_post) %>% 
    xml_find_all('//*[@id="question"]/div/div[2]/div[1]/p[1]') %>% 
        xml_text() %>% stri_replace_all_regex("\p{So}", "ã")

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with ã. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "não". I can see in tests below that gsub() and grepl() can identify "ã" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "não" and even "ã". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().

对于您的原始数据...看看这是否有效：

stringi::stri_escape_unicode(orig_data)  %>% 
    stringi::stri_replace_all_regex("\p{So}", "ã")

还有一件事

您不能 grepl 使用未知字符，因为该函数不知道您要求它匹配什么，而是试试这个：

stringi::stri_unescape_unicode("\u00e3")
[1] "ã"
grepl("\u00e3", stringi::stri_escape_unicode(orig_data), perl = TRUE)
[1]  TRUE FALSE FALSE  TRUE

根据评论编辑：

下面是一个很好的解决方案，因为您得到的“问号”字符可能会因为是 ascii 而丢失。注意在我给出的示例中，您只需将 ANY/ALL 坏字符替换为“ã”。显然这不是一个好方法，但如果您阅读帮助文档，我相信您会看到如何将这种方法与转义混合使用以处理所有字符串。

orig_data$repaired_text <- stringi::stri_enc_toutf8(orig_data$text)  %>%      stringi::stri_replace_all_regex("\p{So}", "ã")

识别 R 中的 Unicode 替换字符（U+FFFD 或 � 或黑色菱形问号）

Identifying Unicode replacement characters (U+FFFD or � or black diamond question mark) in R

text

r

character-encoding

gsub

grepl

对于您的原始数据...看看这是否有效：

还有一件事

根据评论编辑：